Introduction to Data Science Assignment 2023/2024¶
Professors: Pedro G. Ferreira, Alípio Jorge
Group Students: Gonçalo Rocha (up201707455), Manuela Pinheiro (up200400020), Sofia Coelho (up202103646)
import numpy as np
import seaborn as sns
import pylab as pyl
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from sklearn import tree
from boruta import BorutaPy
from sklearn.svm import SVC
from scipy.stats import zscore
from sklearn.impute import KNNImputer
from mixed_naive_bayes import MixedNB
from imblearn.over_sampling import SMOTE
from sklearn.feature_selection import RFE
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve, auc
from imblearn.pipeline import make_pipeline
from sklearn.ensemble import VotingClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import cross_val_score, KFold
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.ensemble import GradientBoostingClassifier as GBC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
1. Business understanding¶
In this work project we propose to develop and deploy a predictive model for customer churn prediction in the telecommunications industry to proactively identify potential churners, allowing the company to implement targeted retention strategies and reduce customer attrition, following the CRISP-DM methodology.
1.1 -The first step is to understand the business problem and define success criteria:
The Business problem: The telecommunications company wants to make better business decisions to decrease the rate at which customers cease their subscriptions. This factor is critical to the company since, first, it reflects the customer satisfaction and, second, has a big impact on the company´s revenues. Identifying the factors and the patterns that lead to that action is crucial to 1) try to change/optimize those factors and 2) re-enforce retention efforts on those clients. Implementing targeted retention efforts for potential churners is expected to be more cost-effective than acquiring new customers.
The Business criteria: Reducing costs by increasing customer retention.
1.2 -Then, we should translate it to a machine learning problem:
The Machine Learning problem: A classification problem: given a dataset of different variables of a client finding a classification model that can predict if the client is going to cease their subscription or not.
The Machine Learning criteria: Prediction Accuracy: Achieve a high accuracy rate in predicting customer churn (that is specifically good at predicting the class Churn = yes) Precision and Recall: Balance precision and recall to effectively identify potential churners without excessive false positives. Model Robustness: Develop a robust model that generalizes well to new data and changing patterns.
1.3 - The key questions that we want to explore during the project are:
What factors contribute most to customer churn in the telecommunications industry? How well can we predict customer churn based on the available data? What features are most indicative of potential churners? How can the company effectively allocate resources to retain customers and reduce churn?
2. Data Understanding¶
2.1 Explore the data¶
The first step in data understanding is to explore the structure and format of the data, identify the type of variables and examine the size and shape of the dataset.
data = pd.read_csv("churn_data.csv")
data
| churn | accountlength | internationalplan | voicemailplan | numbervmailmessages | totaldayminutes | totaldaycalls | totaldaycharge | totaleveminutes | totalevecalls | totalevecharge | totalnightminutes | totalnightcalls | totalnightcharge | totalintlminutes | totalintlcalls | totalintlcharge | numbercustomerservicecalls | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | No | 128.0 | no | yes | 25.0 | 265.1 | 110.0 | 45.07 | 197.4 | 99.0 | 16.78 | 244.7 | 91.0 | 11.01 | 10.0 | 3.0 | 2.70 | 1.0 |
| 1 | No | 107.0 | no | yes | 26.0 | 161.6 | 123.0 | 27.47 | 195.5 | 103.0 | 16.62 | 254.4 | 103.0 | 11.45 | 13.7 | 3.0 | 3.70 | 1.0 |
| 2 | No | 137.0 | no | no | 0.0 | 243.4 | NaN | 41.38 | 121.2 | 110.0 | 10.30 | 162.6 | 104.0 | 7.32 | 12.2 | 5.0 | 3.29 | 0.0 |
| 3 | No | 84.0 | yes | no | 0.0 | 299.4 | 71.0 | 50.90 | 61.9 | 88.0 | 5.26 | 196.9 | 89.0 | 8.86 | 6.6 | 7.0 | 1.78 | 2.0 |
| 4 | No | 75.0 | yes | no | 0.0 | 166.7 | 113.0 | 28.34 | 148.3 | 122.0 | 12.61 | 186.9 | 121.0 | 8.41 | 10.1 | 3.0 | 2.73 | 3.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4995 | No | 50.0 | no | yes | 40.0 | 235.7 | 127.0 | 40.07 | 223.0 | 126.0 | 18.96 | 297.5 | 116.0 | 13.39 | 9.9 | 5.0 | 2.67 | 2.0 |
| 4996 | Yes | 152.0 | no | no | 0.0 | 184.2 | 90.0 | 31.31 | 256.8 | 73.0 | 21.83 | 213.6 | 113.0 | 9.61 | 14.7 | 2.0 | 3.97 | 3.0 |
| 4997 | No | 61.0 | no | no | 0.0 | 140.6 | 89.0 | 23.90 | 172.8 | 128.0 | 14.69 | 212.4 | 97.0 | 9.56 | 13.6 | 4.0 | 3.67 | 1.0 |
| 4998 | No | 109.0 | no | no | 0.0 | 188.8 | 67.0 | 32.10 | 171.7 | 92.0 | 14.59 | 224.4 | 89.0 | NaN | 8.5 | 6.0 | 2.30 | 0.0 |
| 4999 | No | 86.0 | no | yes | 34.0 | 129.4 | 102.0 | 22.00 | 267.1 | 104.0 | 22.70 | 154.8 | 100.0 | 6.97 | 9.3 | 16.0 | 2.51 | 0.0 |
5000 rows × 18 columns
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 churn 5000 non-null object 1 accountlength 4951 non-null float64 2 internationalplan 4950 non-null object 3 voicemailplan 4950 non-null object 4 numbervmailmessages 4950 non-null float64 5 totaldayminutes 4950 non-null float64 6 totaldaycalls 4950 non-null float64 7 totaldaycharge 4950 non-null float64 8 totaleveminutes 4950 non-null float64 9 totalevecalls 4950 non-null float64 10 totalevecharge 4950 non-null float64 11 totalnightminutes 4950 non-null float64 12 totalnightcalls 4950 non-null float64 13 totalnightcharge 4950 non-null float64 14 totalintlminutes 4950 non-null float64 15 totalintlcalls 4950 non-null float64 16 totalintlcharge 4950 non-null float64 17 numbercustomerservicecalls 4950 non-null float64 dtypes: float64(15), object(3) memory usage: 703.3+ KB
- Churn: Binary Variable ("No" or "Yes")
- Account Lenght: Continuos Variable, Ratio-scaled Variable
- International Plan: Binary Variable ("No" or "Yes")
- Voicemail Plan: Binary Variable ("No" or "Yes")
- Number of Email Messages: Continuos Variable, Ratio-scaled Variable
- Total Day Minutes: Continuos Variable, Ratio-scaled Variable
- Total Day Calls: Discrete Numerical Variable, Ratio-scaled Variable
- Total Day Charge: Continuous Variable, Ratio-scaled Variable
- Total Evening Minutes: Continuous Variable, Ratio-scaled Variable
- Total Evening Calls: Discrete Numerical Variable, Ratio-scaled Variable
- Total Evening Charge: Continuous Variable, Ratio-scaled Variable
- Total Night Minutes: Continuous Variable, Ratio-scaled Variable
- Total Night Calls: Discrete Numerical Variable, Ratio-scaled Variable
- Total Night Charge: Continuous Variable, Ratio-scaled Variable
- Total International Minutes: Continuous Variable, Ratio-scaled Variable
- Total International Calls: Discrete Numerical Variable, Ratio-scaled Variable
- Total International Charge: Continuous Variable, Ratio-scaled Variable
- Number Custumer Service Calls: Discrete Numerical Variable, Ratio-scaled Variable
2.2 Detecting duplicates and missing values¶
print(str(data.duplicated().value_counts()))
False 5000 Name: count, dtype: int64
M = data.isna().sum()
M
churn 0 accountlength 49 internationalplan 50 voicemailplan 50 numbervmailmessages 50 totaldayminutes 50 totaldaycalls 50 totaldaycharge 50 totaleveminutes 50 totalevecalls 50 totalevecharge 50 totalnightminutes 50 totalnightcalls 50 totalnightcharge 50 totalintlminutes 50 totalintlcalls 50 totalintlcharge 50 numbercustomerservicecalls 50 dtype: int64
M.sum()
849
data.count()/data.shape[0]
churn 1.0000 accountlength 0.9902 internationalplan 0.9900 voicemailplan 0.9900 numbervmailmessages 0.9900 totaldayminutes 0.9900 totaldaycalls 0.9900 totaldaycharge 0.9900 totaleveminutes 0.9900 totalevecalls 0.9900 totalevecharge 0.9900 totalnightminutes 0.9900 totalnightcalls 0.9900 totalnightcharge 0.9900 totalintlminutes 0.9900 totalintlcalls 0.9900 totalintlcharge 0.9900 numbercustomerservicecalls 0.9900 dtype: float64
missing_values = data[data.isnull().any(axis=1)]
missing_values
| churn | accountlength | internationalplan | voicemailplan | numbervmailmessages | totaldayminutes | totaldaycalls | totaldaycharge | totaleveminutes | totalevecalls | totalevecharge | totalnightminutes | totalnightcalls | totalnightcharge | totalintlminutes | totalintlcalls | totalintlcharge | numbercustomerservicecalls | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | No | 137.0 | no | no | 0.0 | 243.4 | NaN | 41.38 | 121.2 | 110.0 | 10.30 | 162.6 | 104.0 | 7.32 | 12.2 | 5.0 | 3.29 | 0.0 |
| 19 | No | 73.0 | no | no | 0.0 | 224.4 | NaN | 38.15 | 159.5 | 88.0 | 13.56 | 192.8 | 74.0 | 8.68 | 13.0 | 2.0 | 3.51 | 1.0 |
| 20 | No | 147.0 | no | no | NaN | 155.1 | 117.0 | 26.37 | 239.7 | 93.0 | 20.37 | 208.8 | 133.0 | 9.40 | 10.6 | 4.0 | 2.86 | 0.0 |
| 24 | No | NaN | no | no | 0.0 | 81.1 | 86.0 | 13.79 | 245.2 | 72.0 | 20.84 | 237.0 | 115.0 | 10.67 | 10.3 | 2.0 | 2.78 | 0.0 |
| 29 | No | NaN | no | no | 0.0 | 119.3 | 117.0 | 20.28 | 215.1 | 109.0 | 18.28 | 178.7 | 90.0 | 8.04 | 11.1 | 1.0 | 3.00 | 1.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4980 | Yes | 73.0 | no | no | 0.0 | 177.2 | NaN | 30.12 | 270.5 | 84.0 | 22.99 | 241.8 | 112.0 | 10.88 | 12.3 | 2.0 | 3.32 | 3.0 |
| 4988 | No | NaN | no | no | 0.0 | 157.0 | 101.0 | 26.69 | 208.8 | 127.0 | 17.75 | 113.3 | 109.0 | 5.10 | 16.2 | 2.0 | 4.37 | 2.0 |
| 4992 | No | 83.0 | no | no | NaN | 188.3 | 70.0 | 32.01 | 243.8 | 88.0 | 20.72 | 213.7 | 79.0 | 9.62 | 10.3 | 6.0 | 2.78 | 0.0 |
| 4993 | No | 73.0 | no | no | 0.0 | 177.9 | 89.0 | 30.24 | 131.2 | 82.0 | 11.15 | NaN | 89.0 | 8.38 | 11.5 | 6.0 | 3.11 | 3.0 |
| 4998 | No | 109.0 | no | no | 0.0 | 188.8 | 67.0 | 32.10 | 171.7 | 92.0 | 14.59 | 224.4 | 89.0 | NaN | 8.5 | 6.0 | 2.30 | 0.0 |
793 rows × 18 columns
From this information, we can extrapolate the number of missing values each variables has, as well as the number of rows that contain missing values. As the number of rows containing missing values is smaller than the sum of the missing values from all of the variables, we can conclude that there are rows of data containing more than one missing value.
2.3 Exploring Continuous / Numeric Variables¶
A statistical summary of each variable gives us important information about the spread of the data. This provides information regarding the minimum and maximum values (range), the first and third quartiles (25th and 75th percentiles), which allows for the calculation of the Inter Quarile Range (IQR), the mean and the median (50th percentile). Some of this information can also be represented using boxplots, which are a standardized way of displaying the dataset based on the statistical summary (the minimum, the maximum, the sample median, and the first and third quartiles). This helps us to visualize the data distribution, for a better understanding.
data.describe()
| accountlength | numbervmailmessages | totaldayminutes | totaldaycalls | totaldaycharge | totaleveminutes | totalevecalls | totalevecharge | totalnightminutes | totalnightcalls | totalnightcharge | totalintlminutes | totalintlcalls | totalintlcharge | numbercustomerservicecalls | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4951.000000 | 4950.000000 | 4950.000000 | 4950.000000 | 4950.000000 | 4950.000000 | 4950.000000 | 4950.000000 | 4950.000000 | 4950.000000 | 4950.000000 | 4950.000000 | 4950.000000 | 4950.000000 | 4950.000000 |
| mean | 100.238295 | 7.763636 | 180.306625 | 100.038788 | 30.629386 | 200.679798 | 100.243838 | 17.048293 | 200.465697 | 99.932929 | 9.015240 | 10.259010 | 4.432525 | 2.772088 | 1.569091 |
| std | 39.718817 | 13.552928 | 53.926625 | 19.844529 | 9.148881 | 50.486434 | 19.837380 | 4.300503 | 50.498290 | 19.939450 | 2.276812 | 2.763712 | 2.448457 | 0.744552 | 1.305279 |
| min | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 73.000000 | 0.000000 | 143.700000 | 87.000000 | 24.430000 | 166.425000 | 87.000000 | 14.130000 | 167.000000 | 87.000000 | 7.510000 | 8.500000 | 3.000000 | 2.300000 | 1.000000 |
| 50% | 100.000000 | 0.000000 | 180.100000 | 100.000000 | 30.600000 | 201.000000 | 101.000000 | 17.090000 | 200.550000 | 100.000000 | 9.010000 | 10.300000 | 4.000000 | 2.780000 | 1.000000 |
| 75% | 127.000000 | 17.000000 | 216.200000 | 113.000000 | 36.720000 | 234.100000 | 114.000000 | 19.897500 | 234.700000 | 113.000000 | 10.560000 | 12.000000 | 6.000000 | 3.240000 | 2.000000 |
| max | 243.000000 | 52.000000 | 351.500000 | 165.000000 | 59.760000 | 363.700000 | 170.000000 | 30.910000 | 395.000000 | 175.000000 | 17.770000 | 20.000000 | 20.000000 | 5.400000 | 9.000000 |
Similarly to boxplots, histograms and density plots are a way to visually represent our data. Histograms are graphs that shows the frequency of numerical data using rectangles. The height of a rectangle represents the distribution frequency of a variable. It helps to easily identify which variables have a higher frequency.
fig, ax = plt.subplots(3, 5, figsize=(15, 10))
variables = ['accountlength', 'numbervmailmessages', 'totaldayminutes', 'totaldaycalls', 'totaldaycharge']
indice = 0
for i in variables:
data[i].plot.box(ax=ax[0, indice])
data[i].plot.density(ax=ax[1, indice], legend=True)
data[i].plot.hist(ax=ax[2, indice], label=i, bins=10, legend=True)
indice += 1
plt.tight_layout()
fig, ax = plt.subplots(3, 6, figsize=(15, 10))
variables = ['totaleveminutes', 'totalevecalls', 'totalevecharge', 'totalnightminutes' , 'totalnightcalls', 'totalnightcharge']
indice = 0
for i in variables:
data[i].plot.box(ax=ax[0, indice])
data[i].plot.density(ax=ax[1, indice], legend=True)
data[i].plot.hist(ax=ax[2, indice], label=i, bins=10, legend=True)
indice += 1
plt.tight_layout()
fig, ax = plt.subplots(3, 4, figsize=(15, 10))
variables = ['totalintlminutes','totalintlcalls', 'totalintlcharge', 'numbercustomerservicecalls']
indice = 0
for i in variables:
data[i].plot.box(ax=ax[0, indice])
data[i].plot.density(ax=ax[1, indice], legend=True)
data[i].plot.hist(ax=ax[2, indice], label=i, bins=10, legend=True)
indice += 1
plt.tight_layout()
Most of our variables closely follow a normal distribution, however, some variables, namely total international calls, and number of customer service calls, tend to skew to the left. This makes sense, given that these types of calls are not made often. Interestingly, the number of email messages reaches a peak frequency at a lower number, which means that most clients tend to send a small number of emails, but it follows a normal distribution with low frequencies at the higher number of emails sent.
Density plots can be seen as an extension of the histogram, however, the density plot can smooth out the distribution of values and reduce the noise. It visualizes the distribution of data over a given period, and the peaks show where values are concentratd.
2.4 Exploring Nominal and Binary Variables¶
The dataset has 3 binary variables, including the target label "churn". To visually analyse the distribution of nominal variables we used histograms.
variables = ['churn', 'internationalplan', 'voicemailplan']
fig, ax = plt.subplots(1, 3, figsize=(15, 5))
titles = ['Churn', 'International Plan', 'Voicemail Plan']
for i, variable in enumerate(variables):
data[variable].value_counts().plot(kind='bar', ax=ax[i])
ax[i].set_title(titles[i])
plt.tight_layout()
From the information we can see that all of the variables tend do lean to the "No" value. This is particularlly interesting in the "Churn" variable, since it shows that the majority of the clients choose to maintain their subscription.
2.5 - Detecting Outliers¶
Checking for the presence of outliers in the data is crucial because they can disproportionately influence the fitting of statistical models and lead to overfitting.
2.5.1 - Interquartile Range method¶
To identify potential outliers in our dataset we used the Interquartile Range (IQR) method, as is a robust method and less sensitive to extreme values.
Outliers may indicate errors in data collection, data entry, or other data preprocessing steps, but they also may represent interesting patterns or anomalies in the data. The representation of outlier limits in box plots offers a clear and intuitive way to identify, assess, and understand outliers in a dataset. It aids in visually assessing the distribution of data and provides a basis for further exploration and decision-making regarding the treatment of outliers in statistical analyses.
variables = ['accountlength', 'numbervmailmessages', 'totaldayminutes', 'totaldaycalls', 'totaldaycharge',
'totaleveminutes', 'totalevecalls', 'totalevecharge', 'totalnightminutes', 'totalnightcalls',
'totalnightcharge', 'totalintlminutes', 'totalintlcalls', 'totalintlcharge',
'numbercustomerservicecalls']
num_rows = 3
num_cols = 5
fig, axes = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(15, 15))
for i, variable in enumerate(variables):
row = i // num_cols
col = i % num_cols
ax = data[[variable]].boxplot(ax=axes[row, col])
q3 = data[variable].quantile(0.75)
q1 = data[variable].quantile(0.25)
iqr = q3 - q1
topcut = q3 + 1.5 * iqr
lowcut = q1 - 1.5 * iqr
ax.axhline(topcut, c='red', ls='--')
ax.axhline(lowcut, c='orange', ls='--')
plt.tight_layout()
Visually inspecting the above box plots, we observe that the 'account length', 'number of voice mail messages', 'total international calls', and 'number of customer service calls' variables only present outliers on the upper boundary. This could indicate that the data presents a positively skewed distribution, which is in accordance with the previous data distribution plots obtained for the 'number of voice mail messages', 'total international calls', and 'number of customer service calls' variables.
2.5.2 - Common outliers identification using IQR, Z-score and MAD methods¶
Additionally to the IQR method we also used the Z-score and Median Absolute Deviation (MAD) methods to identify potential outliers.
The Z-score measures how many standard deviations a particular data point is from the mean of the dataset and values with Z-scores greater than a certain threshold (we used 3) are flagged as outliers.
MAD is a measure of statistical dispersion that uses the median of absolute deviations from the median and outliers are determined using a threshold (we used 3).
Similar to the IQR method, MAD is a robust method and less sensitive to extreme values, as they use median-based measures. The Z-score can be influenced by extreme values, especially in datasets with non-normal distributions.
For all the variables that presented a normal distribution we identified the common outliers for the three methods, while for the variables with a skewed distribution we only used the IQR and MAD methods.
variables = ['accountlength', 'totaldayminutes', 'totaldaycalls', 'totaldaycharge',
'totaleveminutes', 'totalevecalls', 'totalevecharge', 'totalnightminutes', 'totalnightcalls',
'totalnightcharge', 'totalintlminutes', 'totalintlcharge']
def detect_outliers_iqr(data):
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return data[(data < lower_bound) | (data > upper_bound)]
def detect_outliers_mad(data, threshold=3):
median = data.median()
abs_deviations = np.abs(data - median)
mad = abs_deviations.median()
return data[abs_deviations > threshold * mad]
outliers_iqr = {}
outliers_mad = {}
outliers_z = {}
for var in variables:
outliers_iqr[var] = set(detect_outliers_iqr(data[var]).index)
outliers_mad[var] = set(detect_outliers_mad(data[var]).index)
mean = data[var].mean()
std_dev = data[var].std()
z_scores = (data[var] - mean) / std_dev
outliers_z[var] = set(data[abs(z_scores) > 3].index)
common_outliers = {}
for var in variables:
common_outliers[var] = outliers_iqr[var].intersection(outliers_mad[var], outliers_z[var])
for var in variables:
print(f"Common outliers for '{var}':")
print(common_outliers[var])
print(f"Number of common outliers for '{var}': {len(common_outliers[var])}")
print("\n")
Common outliers for 'accountlength':
{416, 1408, 4260, 4389, 4395, 1551, 3216, 817, 1886, 1751, 4379, 4798}
Number of common outliers for 'accountlength': 12
Common outliers for 'totaldayminutes':
{1345, 1986, 2594, 2753, 2252, 365, 1052, 2736, 1397, 4981, 985, 3993, 4892}
Number of common outliers for 'totaldayminutes': 13
Common outliers for 'totaldaycalls':
{1345, 1121, 740, 1989, 4049, 4562, 3187, 692, 468, 1397, 1460, 1144, 4155, 4831}
Number of common outliers for 'totaldaycalls': 14
Common outliers for 'totaldaycharge':
{1345, 1986, 2594, 2753, 2252, 365, 1052, 2736, 1397, 4981, 985, 3993, 4892}
Number of common outliers for 'totaldaycharge': 13
Common outliers for 'totaleveminutes':
{32, 2551, 1960, 3996, 3370, 2732, 1233, 2932, 821, 3989, 4375, 4885, 889, 2331, 4892}
Number of common outliers for 'totaleveminutes': 15
Common outliers for 'totalevecalls':
{960, 3940, 4741, 646, 301, 1615, 3219, 2932, 58}
Number of common outliers for 'totalevecalls': 9
Common outliers for 'totalevecharge':
{32, 2551, 3996, 3370, 2732, 1233, 2932, 821, 533, 4375, 3989, 889, 4885, 2331, 4892}
Number of common outliers for 'totalevecharge': 15
Common outliers for 'totalnightminutes':
{3107, 4707, 1317, 1445, 2663, 1260, 3247, 3376, 2321, 883, 244, 3060, 1238, 4727, 1113, 922, 3899, 3807}
Number of common outliers for 'totalnightminutes': 18
Common outliers for 'totalnightcalls':
{2659, 4707, 3560, 3211, 2988, 493, 3895, 3535, 2288, 4207, 3570, 2903, 3799, 4410, 4508}
Number of common outliers for 'totalnightcalls': 15
Common outliers for 'totalnightcharge':
{3107, 4707, 1317, 1445, 2663, 1260, 3247, 3376, 2321, 883, 244, 3060, 1238, 4727, 1113, 922, 3899, 3807}
Number of common outliers for 'totalnightcharge': 18
Common outliers for 'totalintlminutes':
{1028, 4371, 4504, 1564, 2345, 4396, 2733, 179, 3891, 1080, 2362, 314, 3906, 4935, 712, 3400, 3275, 3403, 3916, 3663, 2513, 595, 4949, 343, 2906, 3290, 4451, 2918, 488, 2669, 878, 3568, 4976, 115, 1400, 762}
Number of common outliers for 'totalintlminutes': 36
Common outliers for 'totalintlcharge':
{1028, 4371, 4504, 1564, 2345, 4396, 2733, 179, 3891, 1080, 2362, 314, 3906, 4935, 712, 3400, 3275, 3403, 3916, 3663, 2513, 595, 4949, 343, 2906, 3290, 4451, 2918, 488, 2669, 878, 3568, 4976, 115, 1400, 762}
Number of common outliers for 'totalintlcharge': 36
variables = ['numbervmailmessages', 'totalintlcalls', 'numbercustomerservicecalls']
def detect_outliers_iqr(data):
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return data[(data < lower_bound) | (data > upper_bound)]
def detect_outliers_mad(data, threshold=3):
median = data.median()
abs_deviations = np.abs(data - median)
mad = abs_deviations.median()
return data[abs_deviations > threshold * mad]
outliers_iqr = {}
outliers_mad = {}
for var in variables:
outliers_iqr[var] = set(detect_outliers_iqr(data[var]).index)
outliers_mad[var] = set(detect_outliers_mad(data[var]).index)
common_outliers = {}
for var in variables:
common_outliers[var] = outliers_iqr[var].intersection(outliers_mad[var])
for var in variables:
print(f"Common outliers for '{var}':")
print(common_outliers[var])
print(f"Number of common outliers for '{var}': {len(common_outliers[var])}")
print("\n")
Common outliers for 'numbervmailmessages':
{1797, 1285, 2570, 268, 3342, 4369, 4625, 277, 149, 790, 4502, 2457, 2716, 1441, 4516, 423, 4777, 4778, 1454, 3246, 2608, 4275, 1846, 1596, 2366, 1602, 3523, 1732, 3524, 3910, 71, 2887, 4933, 845, 1487, 2768, 3279, 3154, 599, 3543, 4824, 4059, 3165, 4445, 1378, 4706, 3174, 615, 872, 3433, 4713, 4588, 4717, 3826, 4338, 1908, 1269, 3577, 4859, 2686}
Number of common outliers for 'numbervmailmessages': 60
Common outliers for 'totalintlcalls':
{514, 3587, 2562, 4108, 2576, 22, 3097, 4126, 1567, 4641, 3109, 4134, 41, 4652, 1581, 3636, 3128, 2621, 1092, 4679, 588, 4684, 3664, 4693, 4184, 2156, 4726, 1657, 636, 642, 3206, 2703, 1168, 3727, 153, 1179, 3230, 674, 2212, 2724, 182, 185, 698, 4286, 4292, 211, 723, 2775, 3800, 219, 3291, 3310, 3826, 756, 4857, 250, 3839, 3840, 1797, 2826, 4877, 272, 2835, 4386, 1832, 1333, 4917, 4408, 3386, 2883, 837, 329, 842, 1355, 341, 854, 3418, 347, 4955, 863, 1889, 3938, 1392, 4465, 2930, 883, 377, 3454, 2947, 3973, 4999, 1419, 2956, 911, 3991, 921, 2970, 420, 1960, 3516, 957, 3536, 2001, 3025, 3541, 982, 4568, 474, 2018, 483, 1526, 504, 1021, 3071}
Number of common outliers for 'totalintlcalls': 114
Common outliers for 'numbercustomerservicecalls':
{3585, 1538, 3079, 3081, 522, 3595, 1038, 4110, 4114, 4628, 21, 4631, 542, 2592, 4130, 547, 4132, 4644, 3112, 48, 54, 3131, 3643, 4670, 3140, 3655, 3144, 588, 4687, 3665, 4692, 2646, 1121, 1638, 3692, 1133, 3181, 1142, 3190, 635, 1150, 1662, 4740, 3717, 136, 1673, 655, 3228, 1694, 4772, 1193, 2218, 1707, 2732, 3243, 4778, 2223, 1713, 181, 694, 4790, 2248, 721, 4307, 1241, 4315, 1246, 736, 2785, 2786, 742, 235, 2283, 1273, 4346, 3323, 771, 1284, 778, 1802, 2827, 3851, 2322, 2327, 293, 1831, 1325, 4406, 4919, 1851, 4417, 841, 1865, 332, 2380, 3918, 2387, 3413, 2903, 3421, 2402, 874, 2415, 4982, 1912, 2428, 1919, 1407, 3973, 902, 392, 2952, 2953, 4489, 908, 2444, 2958, 911, 2960, 2961, 3468, 3983, 3487, 2979, 1970, 4018, 1973, 1974, 3509, 4023, 4537, 2493, 3009, 3017, 4044, 974, 3026, 2515, 4562, 3549, 1502, 2031, 4085, 2553, 509}
Number of common outliers for 'numbercustomerservicecalls': 145
2.6 Bivariate Analysis¶
To test variable redundacy, correlation analysis was performed by groups. We decided to group variables in the scope of time periods and type of comunication service. Namely:
1- 'totaldayminutes', 'totaldaycalls', 'totaldaycharge';
2- 'totaleveminutes', 'totalevecalls', 'totalevecharge';
3- 'totalnightminutes', 'totalnightcalls', 'totalnightcharge';
4- 'totalintlminutes', 'totalintlcalls', 'totalintlcharge'.
data[['totaldayminutes', 'totaldaycalls', 'totaldaycharge']].corr()
| totaldayminutes | totaldaycalls | totaldaycharge | |
|---|---|---|---|
| totaldayminutes | 1.000000 | 0.003293 | 0.992380 |
| totaldaycalls | 0.003293 | 1.000000 | -0.000558 |
| totaldaycharge | 0.992380 | -0.000558 | 1.000000 |
data[['totaleveminutes', 'totalevecalls', 'totalevecharge']].corr()
| totaleveminutes | totalevecalls | totalevecharge | |
|---|---|---|---|
| totaleveminutes | 1.00000 | 0.004650 | 1.000000 |
| totalevecalls | 0.00465 | 1.000000 | 0.004467 |
| totalevecharge | 1.00000 | 0.004467 | 1.000000 |
data[['totalnightminutes', 'totalnightcalls', 'totalnightcharge']].corr()
| totalnightminutes | totalnightcalls | totalnightcharge | |
|---|---|---|---|
| totalnightminutes | 1.000000 | 0.024878 | 0.999999 |
| totalnightcalls | 0.024878 | 1.000000 | 0.028675 |
| totalnightcharge | 0.999999 | 0.028675 | 1.000000 |
data[['totalintlminutes','totalintlcalls', 'totalintlcharge']].corr()
| totalintlminutes | totalintlcalls | totalintlcharge | |
|---|---|---|---|
| totalintlminutes | 1.000000 | 0.018186 | 0.994521 |
| totalintlcalls | 0.018186 | 1.000000 | 0.022057 |
| totalintlcharge | 0.994521 | 0.022057 | 1.000000 |
fig, axs = plt.subplots(1, 4, figsize=(17, 5))
data.plot.scatter('totaldayminutes', 'totaldaycharge', ax=axs[0])
axs[0].set_title('Total day minutes vs Total day charge')
data.plot.scatter('totaleveminutes', 'totalevecharge', ax=axs[1])
axs[1].set_title('Total evening minutes vs Total evening charge')
data.plot.scatter('totalnightminutes', 'totalnightcharge', ax=axs[2])
axs[2].set_title('Total night minutes vs Total night charge')
data.plot.scatter('totalintlminutes', 'totalintlcharge', ax=axs[3])
axs[3].set_title('Total international minutes vs Total international charge')
plt.tight_layout()
plt.show()
A high correlation was observed between the variables 'total count of minutes' and 'total charge' within each group, suggesting a strong linear relationship between these variables.
These results indicate that one of these variables might be redundant and could potentially be removed from the dataset without significantly affecting the information conveyed by the variables.
3 - Data Preparation¶
The data set features "Churn", "Internationplan" and "Voicemailplan", which presented categorical values, ware converted into a numeric type so that it could be provided to models that only work with this type of data. Therefore, observations with the value "No" assumed the value 0 and the ones with "Yes" the value 1.
columns_to_encode = ['churn', 'internationalplan', 'voicemailplan']
label_encoder = LabelEncoder()
for col in columns_to_encode:
data[col] = label_encoder.fit_transform(data[col])
data
| churn | accountlength | internationalplan | voicemailplan | numbervmailmessages | totaldayminutes | totaldaycalls | totaldaycharge | totaleveminutes | totalevecalls | totalevecharge | totalnightminutes | totalnightcalls | totalnightcharge | totalintlminutes | totalintlcalls | totalintlcharge | numbercustomerservicecalls | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 128.0 | 0 | 1 | 25.0 | 265.1 | 110.0 | 45.07 | 197.4 | 99.0 | 16.78 | 244.7 | 91.0 | 11.01 | 10.0 | 3.0 | 2.70 | 1.0 |
| 1 | 0 | 107.0 | 0 | 1 | 26.0 | 161.6 | 123.0 | 27.47 | 195.5 | 103.0 | 16.62 | 254.4 | 103.0 | 11.45 | 13.7 | 3.0 | 3.70 | 1.0 |
| 2 | 0 | 137.0 | 0 | 0 | 0.0 | 243.4 | NaN | 41.38 | 121.2 | 110.0 | 10.30 | 162.6 | 104.0 | 7.32 | 12.2 | 5.0 | 3.29 | 0.0 |
| 3 | 0 | 84.0 | 1 | 0 | 0.0 | 299.4 | 71.0 | 50.90 | 61.9 | 88.0 | 5.26 | 196.9 | 89.0 | 8.86 | 6.6 | 7.0 | 1.78 | 2.0 |
| 4 | 0 | 75.0 | 1 | 0 | 0.0 | 166.7 | 113.0 | 28.34 | 148.3 | 122.0 | 12.61 | 186.9 | 121.0 | 8.41 | 10.1 | 3.0 | 2.73 | 3.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4995 | 0 | 50.0 | 0 | 1 | 40.0 | 235.7 | 127.0 | 40.07 | 223.0 | 126.0 | 18.96 | 297.5 | 116.0 | 13.39 | 9.9 | 5.0 | 2.67 | 2.0 |
| 4996 | 1 | 152.0 | 0 | 0 | 0.0 | 184.2 | 90.0 | 31.31 | 256.8 | 73.0 | 21.83 | 213.6 | 113.0 | 9.61 | 14.7 | 2.0 | 3.97 | 3.0 |
| 4997 | 0 | 61.0 | 0 | 0 | 0.0 | 140.6 | 89.0 | 23.90 | 172.8 | 128.0 | 14.69 | 212.4 | 97.0 | 9.56 | 13.6 | 4.0 | 3.67 | 1.0 |
| 4998 | 0 | 109.0 | 0 | 0 | 0.0 | 188.8 | 67.0 | 32.10 | 171.7 | 92.0 | 14.59 | 224.4 | 89.0 | NaN | 8.5 | 6.0 | 2.30 | 0.0 |
| 4999 | 0 | 86.0 | 0 | 1 | 34.0 | 129.4 | 102.0 | 22.00 | 267.1 | 104.0 | 22.70 | 154.8 | 100.0 | 6.97 | 9.3 | 16.0 | 2.51 | 0.0 |
5000 rows × 18 columns
3.2 - Data Cleaning: missing values¶
Handling missing values in a dataset is crucial to ensure the integrity of the analysis.
Row removal¶
for row in data.itertuples():
if data.isnull().sum(axis=1)[row[0]] >= 2:
print(f"Row {row[0]} has {data.isnull().sum(axis=1)[row[0]]} NaN values")
Row 233 has 2 NaN values Row 376 has 2 NaN values Row 485 has 2 NaN values Row 517 has 2 NaN values Row 546 has 2 NaN values Row 592 has 2 NaN values Row 594 has 2 NaN values Row 717 has 2 NaN values Row 748 has 2 NaN values Row 1082 has 2 NaN values Row 1091 has 2 NaN values Row 1178 has 2 NaN values Row 1351 has 2 NaN values Row 1437 has 2 NaN values Row 1737 has 2 NaN values Row 1839 has 2 NaN values Row 1984 has 2 NaN values Row 2095 has 2 NaN values Row 2107 has 2 NaN values Row 2246 has 2 NaN values Row 2324 has 2 NaN values Row 2524 has 2 NaN values Row 2754 has 2 NaN values Row 2762 has 2 NaN values Row 2944 has 2 NaN values Row 2947 has 2 NaN values Row 3154 has 2 NaN values Row 3262 has 2 NaN values Row 3633 has 2 NaN values Row 3686 has 3 NaN values Row 3752 has 2 NaN values Row 3791 has 3 NaN values Row 3937 has 2 NaN values Row 3967 has 2 NaN values Row 4004 has 2 NaN values Row 4071 has 2 NaN values Row 4081 has 2 NaN values Row 4147 has 2 NaN values Row 4215 has 2 NaN values Row 4396 has 2 NaN values Row 4612 has 2 NaN values Row 4727 has 2 NaN values Row 4731 has 2 NaN values
As seen above, the rows that have more than one missing value only contain maximum of three missing values. Since there are 17 variables (other than our target variable, churn), we consider that this number is not significantly high for any of these rows to be removed.
Imputation¶
One strategy to address missing data involves imputation, where missing values are filled in.
In this dataset we applied the following Imputation methods:
1 - Mean Imputation, where missing values are replaced by the mean of the observed values for a certain variable. It is a simple and quick method but may not capture the true distribution if missing values are significant or occur in a pattern.
2 - Median Imputation, where missing values are replaced by the median of the observed values. Useful for handling outliers as it's less sensitive to extreme values compared to mean imputation, being a good alternative for skewed distributions or datasets with outliers.
3 - Linear Imputation, that uses a linear regression model to predict missing values based on other observed values. Effective when relationships between variables are linear and missing values are related to other observed variables. It captures relationships between features to estimate missing values.
4 - KNN Imputation, that uses the values of the k-nearest neighbors (nearest data points) to estimate missing values. It finds similar data points based on other features and uses their values to impute the missing value. Effective for datasets where the missingness has a complex pattern or where features have non-linear relationships. KNN imputation considers multiple features and their relationships to estimate missing values.
variables_to_fill = ['totaldaycalls', 'totalevecalls', 'totalnightcalls','totalintlcalls']
data_mean_imputed = data.copy()
data_median_imputed = data.copy()
for col in variables_to_fill:
data_mean_imputed[col].fillna(data[col].mean(), inplace=True)
data_median_imputed[col].fillna(data[col].median(), inplace=True)
data_interpolated = data.copy()
data_interpolated[variables_to_fill] = data_interpolated[variables_to_fill].interpolate(method='linear')
imputer = KNNImputer(n_neighbors=5)
data_knn_imputed = data.copy()
data_knn_imputed[variables_to_fill] = imputer.fit_transform(data_knn_imputed[variables_to_fill])
mean_imputed_stats = data_mean_imputed[variables_to_fill].describe()
median_imputed_stats = data_median_imputed[variables_to_fill].describe()
interpolated_stats = data_interpolated[variables_to_fill].describe()
knn_imputed_stats = data_knn_imputed[variables_to_fill].describe()
without_imputation = data[['totaldaycalls', 'totalevecalls','totalnightcalls','totalintlcalls']].describe()
print("Statistics without Imputation:")
print(without_imputation)
print("\nStatistics for Mean Imputation:")
print(mean_imputed_stats)
print("\nStatistics for Median Imputation:")
print(median_imputed_stats)
print("\nStatistics for Linear Interpolation:")
print(interpolated_stats)
print("\nStatistics for KNN Imputation:")
print(knn_imputed_stats)
Statistics without Imputation:
totaldaycalls totalevecalls totalnightcalls totalintlcalls
count 4950.000000 4950.000000 4950.000000 4950.000000
mean 100.038788 100.243838 99.932929 4.432525
std 19.844529 19.837380 19.939450 2.448457
min 0.000000 0.000000 0.000000 0.000000
25% 87.000000 87.000000 87.000000 3.000000
50% 100.000000 101.000000 100.000000 4.000000
75% 113.000000 114.000000 113.000000 6.000000
max 165.000000 170.000000 175.000000 20.000000
Statistics for Mean Imputation:
totaldaycalls totalevecalls totalnightcalls totalintlcalls
count 5000.000000 5000.000000 5000.000000 5000.000000
mean 100.038788 100.243838 99.932929 4.432525
std 19.745037 19.737924 19.839483 2.436181
min 0.000000 0.000000 0.000000 0.000000
25% 87.000000 87.000000 87.000000 3.000000
50% 100.000000 100.243838 100.000000 4.000000
75% 113.000000 113.000000 113.000000 6.000000
max 165.000000 170.000000 175.000000 20.000000
Statistics for Median Imputation:
totaldaycalls totalevecalls totalnightcalls totalintlcalls
count 5000.000000 5000.000000 5000.000000 5000.000000
mean 100.038400 100.251400 99.933600 4.428200
std 19.745037 19.738068 19.839484 2.436562
min 0.000000 0.000000 0.000000 0.000000
25% 87.000000 87.000000 87.000000 3.000000
50% 100.000000 101.000000 100.000000 4.000000
75% 113.000000 113.000000 113.000000 6.000000
max 165.000000 170.000000 175.000000 20.000000
Statistics for Linear Interpolation:
totaldaycalls totalevecalls totalnightcalls totalintlcalls
count 5000.000000 5000.000000 5000.000000 5000.000000
mean 100.021300 100.240400 99.940700 4.433500
std 19.797851 19.774226 19.904891 2.443263
min 0.000000 0.000000 0.000000 0.000000
25% 87.000000 87.000000 87.000000 3.000000
50% 100.000000 100.500000 100.000000 4.000000
75% 113.000000 114.000000 113.000000 6.000000
max 165.000000 170.000000 175.000000 20.000000
Statistics for KNN Imputation:
totaldaycalls totalevecalls totalnightcalls totalintlcalls
count 5000.00000 5000.000000 5000.000000 5000.00000
mean 100.02932 100.246000 99.939800 4.43208
std 19.76802 19.759962 19.855006 2.43898
min 0.00000 0.000000 0.000000 0.00000
25% 87.00000 87.000000 87.000000 3.00000
50% 100.00000 100.000000 100.000000 4.00000
75% 113.00000 113.050000 113.000000 6.00000
max 165.00000 170.000000 175.000000 20.00000
for col in variables_to_fill:
plt.figure(figsize=(14, 4))
plt.subplot(1, 5, 1)
sns.histplot(data[col], bins=20, kde=True, color='purple', label='Without Imputation')
plt.legend()
plt.subplot(1, 5, 2)
sns.histplot(data_mean_imputed[col], bins=20, kde=True, color='blue', label='Mean Imputation')
plt.legend()
plt.subplot(1, 5, 3)
sns.histplot(data_median_imputed[col], bins=20, kde=True, color='orange', label='Median Imputation')
plt.legend()
plt.subplot(1, 5, 4)
sns.histplot(data_interpolated[col], bins=20, kde=True, color='green', label='Linear Interpolation')
plt.legend()
plt.subplot(1, 5, 5)
sns.histplot(data_knn_imputed[col], bins=20, kde=True, color='red', label='KNN Imputation')
plt.legend()
plt.tight_layout()
From the analysis of the obtained statistics and visually inspecting the above histograms, the minor variations observed among the dataset before and after employing different imputation methods didn't significantly alter the statistical properties of the data.
This indicates that the various imputation methods preserved the dataset's integrity by estimating missing values in ways that align with the data's inherent statistical characteristics.
Considering that the KNN imputation method can handle both continuous and categorical variables and no assumptions about the distribution of data is required, we decided to use this method to fill the missing values in our dataset.
KNN Imputation¶
imputer = KNNImputer(n_neighbors=5)
data_imputed = imputer.fit_transform(data)
data = pd.DataFrame(data_imputed, columns=data.columns)
imputed_stats = data.describe()
print("\nStatistics for KNN Imputation:")
print(imputed_stats)
Statistics for KNN Imputation:
churn accountlength internationalplan voicemailplan \
count 5000.000000 5000.000000 5000.000000 5000.000000
mean 0.141400 100.226784 0.113800 0.282400
std 0.348469 39.554740 0.347669 0.471905
min 0.000000 1.000000 0.000000 0.000000
25% 0.000000 73.000000 0.000000 0.000000
50% 0.000000 100.000000 0.000000 0.000000
75% 0.000000 127.000000 0.000000 1.000000
max 1.000000 243.000000 2.000000 2.000000
numbervmailmessages totaldayminutes totaldaycalls totaldaycharge \
count 5000.000000 5000.000000 5000.000000 5000.000000
mean 7.760680 180.303714 100.035360 30.638822
std 13.499377 53.725615 19.766022 9.148733
min 0.000000 0.000000 0.000000 0.000000
25% 0.000000 143.900000 87.000000 24.430000
50% 0.000000 180.050000 100.000000 30.600000
75% 16.000000 216.000000 113.000000 36.750000
max 52.000000 351.500000 165.000000 59.760000
totaleveminutes totalevecalls totalevecharge totalnightminutes \
count 5000.000000 5000.000000 5000.000000 5000.000000
mean 200.634184 100.250200 17.054007 200.468532
std 50.291279 19.759087 4.293597 50.276816
min 0.000000 0.000000 0.000000 0.000000
25% 166.800000 87.000000 14.140000 167.300000
50% 201.000000 101.000000 17.090000 200.500000
75% 233.800000 113.000000 19.900000 234.125000
max 363.700000 170.000000 30.910000 395.000000
totalnightcalls totalnightcharge totalintlminutes totalintlcalls \
count 5000.00000 5000.000000 5000.000000 5000.00000
mean 99.92300 9.017016 10.262268 4.42888
std 19.85791 2.271356 2.752844 2.43862
min 0.00000 0.000000 0.000000 0.00000
25% 87.00000 7.510000 8.500000 3.00000
50% 100.00000 9.020000 10.300000 4.00000
75% 113.00000 10.557000 12.000000 6.00000
max 175.00000 17.770000 20.000000 20.00000
totalintlcharge numbercustomerservicecalls
count 5000.000000 5000.00000
mean 2.771706 1.56956
std 0.741653 1.29969
min 0.000000 0.00000
25% 2.309000 1.00000
50% 2.780000 1.00000
75% 3.240000 2.00000
max 5.400000 9.00000
3.3 - Data Cleaning: outliers¶
We performed, in a copy of the dataset, removal of the common outliers using the IQR, Z-score and MAD methods.
data_out_rem = data.copy()
variables = ['accountlength', 'totaldayminutes', 'totaldaycalls', 'totaldaycharge',
'totaleveminutes', 'totalevecalls', 'totalevecharge', 'totalnightminutes',
'totalnightcalls', 'totalnightcharge', 'totalintlminutes', 'totalintlcharge']
def detect_outliers_iqr(data):
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return data[(data < lower_bound) | (data > upper_bound)]
def detect_outliers_mad(data, threshold=3):
median = data.median()
abs_deviations = np.abs(data - median)
mad = abs_deviations.median()
return data[abs_deviations > threshold * mad]
outliers_iqr = {}
outliers_mad = {}
outliers_z = {}
for var in variables:
outliers_iqr[var] = set(detect_outliers_iqr(data[var]).index)
outliers_mad[var] = set(detect_outliers_mad(data[var]).index)
mean = data[var].mean()
std_dev = data[var].std()
z_scores = (data[var] - mean) / std_dev
outliers_z[var] = set(data[abs(z_scores) > 3].index)
common_outliers = {}
for var in variables:
common_outliers[var] = outliers_iqr[var].intersection(outliers_mad[var], outliers_z[var])
common_indices = set.intersection(*common_outliers.values())
data_out_rem = data_out_rem.drop(common_indices)
for var in variables:
common_var_outliers = common_outliers[var].intersection(data_out_rem.index)
print(f"Number of common outliers removed for '{var}': {len(common_var_outliers)}")
stats_after = {}
for var in variables:
stats_after[var] = data_out_rem[~data_out_rem.index.isin(common_outliers[var])][var].describe()
print("\nStatistics After Removing Outliers:")
for var, stats in stats_after.items():
print(f"\nStatistics for '{var}':")
print(stats)
Number of common outliers removed for 'accountlength': 12 Number of common outliers removed for 'totaldayminutes': 14 Number of common outliers removed for 'totaldaycalls': 14 Number of common outliers removed for 'totaldaycharge': 13 Number of common outliers removed for 'totaleveminutes': 17 Number of common outliers removed for 'totalevecalls': 9 Number of common outliers removed for 'totalevecharge': 15 Number of common outliers removed for 'totalnightminutes': 18 Number of common outliers removed for 'totalnightcalls': 15 Number of common outliers removed for 'totalnightcharge': 18 Number of common outliers removed for 'totalintlminutes': 39 Number of common outliers removed for 'totalintlcharge': 39 Statistics After Removing Outliers: Statistics for 'accountlength': count 4988.000000 mean 99.918388 std 39.097257 min 1.000000 25% 73.000000 50% 100.000000 75% 126.000000 max 217.000000 Name: accountlength, dtype: float64 Statistics for 'totaldayminutes': count 4986.000000 mean 180.514054 std 53.030303 min 19.500000 25% 144.100000 50% 180.100000 75% 216.000000 max 338.400000 Name: totaldayminutes, dtype: float64 Statistics for 'totaldaycalls': count 4986.000000 mean 100.127517 std 19.447740 min 42.000000 25% 87.000000 50% 100.000000 75% 113.000000 max 158.000000 Name: totaldaycalls, dtype: float64 Statistics for 'totaldaycharge': count 4987.000000 mean 30.669035 std 9.038280 min 3.210000 25% 24.450000 50% 30.600000 75% 36.750000 max 57.530000 Name: totaldaycharge, dtype: float64 Statistics for 'totaleveminutes': count 4983.000000 mean 200.806125 std 49.487535 min 52.900000 25% 167.000000 50% 201.000000 75% 233.800000 max 350.900000 Name: totaleveminutes, dtype: float64 Statistics for 'totalevecalls': count 4991.000000 mean 100.271889 std 19.531729 min 42.000000 25% 87.000000 50% 101.000000 75% 113.000000 max 159.000000 Name: totalevecalls, dtype: float64 Statistics for 'totalevecharge': count 4985.000000 mean 17.073894 std 4.232246 min 4.180000 25% 14.170000 50% 17.090000 75% 19.900000 max 29.930000 Name: totalevecharge, dtype: float64 Statistics for 'totalnightminutes': count 4982.000000 mean 200.266792 std 49.353499 min 50.100000 25% 167.325000 50% 200.400000 75% 233.700000 max 350.200000 Name: totalnightminutes, dtype: float64 Statistics for 'totalnightcalls': count 4985.000000 mean 99.917553 std 19.517249 min 41.000000 25% 87.000000 50% 100.000000 75% 113.000000 max 159.000000 Name: totalnightcalls, dtype: float64 Statistics for 'totalnightcharge': count 4982.000000 mean 9.007922 std 2.229986 min 2.250000 25% 7.520000 50% 9.010000 75% 10.537500 max 15.760000 Name: totalnightcharge, dtype: float64 Statistics for 'totalintlminutes': count 4961.000000 mean 10.309825 std 2.623528 min 2.100000 25% 8.600000 50% 10.400000 75% 12.000000 max 18.500000 Name: totalintlminutes, dtype: float64 Statistics for 'totalintlcharge': count 4961.000000 mean 2.783654 std 0.706657 min 0.570000 25% 2.320000 50% 2.810000 75% 3.240000 max 4.970000 Name: totalintlcharge, dtype: float64
data_out_rem = data.copy()
variables = ['numbervmailmessages', 'totalintlcalls', 'numbercustomerservicecalls']
def detect_outliers_iqr(data):
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return data[(data < lower_bound) | (data > upper_bound)]
def detect_outliers_mad(data, threshold=4):
median = data.median()
abs_deviations = np.abs(data - median)
mad = abs_deviations.median()
return data[abs_deviations > threshold * mad]
outliers_iqr = {}
outliers_mad = {}
for var in variables:
outliers_iqr[var] = set(detect_outliers_iqr(data[var]).index)
outliers_mad[var] = set(detect_outliers_mad(data[var]).index)
common_outliers = {}
for var in variables:
common_outliers[var] = outliers_iqr[var].intersection(outliers_mad[var])
common_indices = set.intersection(*common_outliers.values())
data_out_rem = data_out_rem.drop(common_indices)
for var in variables:
common_var_outliers = common_outliers[var].intersection(data_out_rem.index)
print(f"Number of common outliers removed for '{var}': {len(common_var_outliers)}")
stats_after = {}
for var in variables:
stats_after[var] = data_out_rem[~data_out_rem.index.isin(common_outliers[var])][var].describe()
print("\nStatistics After Removing Outliers:")
for var, stats in stats_after.items():
print(f"\nStatistics for '{var}':")
print(stats)
Number of common outliers removed for 'numbervmailmessages': 99 Number of common outliers removed for 'totalintlcalls': 114 Number of common outliers removed for 'numbercustomerservicecalls': 51 Statistics After Removing Outliers: Statistics for 'numbervmailmessages': count 4901.000000 mean 7.031504 std 12.606129 min 0.000000 25% 0.000000 50% 0.000000 75% 8.800000 max 40.000000 Name: numbervmailmessages, dtype: float64 Statistics for 'totalintlcalls': count 4886.000000 mean 4.232788 std 2.070127 min 0.000000 25% 3.000000 50% 4.000000 75% 5.000000 max 10.000000 Name: totalintlcalls, dtype: float64 Statistics for 'numbercustomerservicecalls': count 4949.000000 mean 1.519256 std 1.205275 min 0.000000 25% 1.000000 50% 1.000000 75% 2.000000 max 5.000000 Name: numbercustomerservicecalls, dtype: float64
Given the high number of rows that would be discarded, we decided to not remove outliers from the original dataset to not compromise the integrity of the dataset.
data.describe()
| churn | accountlength | internationalplan | voicemailplan | numbervmailmessages | totaldayminutes | totaldaycalls | totaldaycharge | totaleveminutes | totalevecalls | totalevecharge | totalnightminutes | totalnightcalls | totalnightcharge | totalintlminutes | totalintlcalls | totalintlcharge | numbercustomerservicecalls | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.00000 | 5000.000000 | 5000.000000 | 5000.00000 | 5000.000000 | 5000.00000 |
| mean | 0.141400 | 100.226784 | 0.113800 | 0.282400 | 7.760680 | 180.303714 | 100.035360 | 30.638822 | 200.634184 | 100.250200 | 17.054007 | 200.468532 | 99.92300 | 9.017016 | 10.262268 | 4.42888 | 2.771706 | 1.56956 |
| std | 0.348469 | 39.554740 | 0.347669 | 0.471905 | 13.499377 | 53.725615 | 19.766022 | 9.148733 | 50.291279 | 19.759087 | 4.293597 | 50.276816 | 19.85791 | 2.271356 | 2.752844 | 2.43862 | 0.741653 | 1.29969 |
| min | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.00000 |
| 25% | 0.000000 | 73.000000 | 0.000000 | 0.000000 | 0.000000 | 143.900000 | 87.000000 | 24.430000 | 166.800000 | 87.000000 | 14.140000 | 167.300000 | 87.00000 | 7.510000 | 8.500000 | 3.00000 | 2.309000 | 1.00000 |
| 50% | 0.000000 | 100.000000 | 0.000000 | 0.000000 | 0.000000 | 180.050000 | 100.000000 | 30.600000 | 201.000000 | 101.000000 | 17.090000 | 200.500000 | 100.00000 | 9.020000 | 10.300000 | 4.00000 | 2.780000 | 1.00000 |
| 75% | 0.000000 | 127.000000 | 0.000000 | 1.000000 | 16.000000 | 216.000000 | 113.000000 | 36.750000 | 233.800000 | 113.000000 | 19.900000 | 234.125000 | 113.00000 | 10.557000 | 12.000000 | 6.00000 | 3.240000 | 2.00000 |
| max | 1.000000 | 243.000000 | 2.000000 | 2.000000 | 52.000000 | 351.500000 | 165.000000 | 59.760000 | 363.700000 | 170.000000 | 30.910000 | 395.000000 | 175.00000 | 17.770000 | 20.000000 | 20.00000 | 5.400000 | 9.00000 |
3.3 - Data Reduction: Feature Selection¶
While there are seventeen features available to predict the target, analyzing all the data objects can be a time-consuming and expensive process. It is also important to ensure that the analysis is focused on the most relevant information and to measure the impact of considering different sets of features.
Our approach included:
1) Identifying Redundant Features through a Correlation Matrix. Correlation matrix is a commonly used method in which the correlation for each pair of numerical variables is computed. With this, it is possible to filter out variables with low correlation to the target and remove redundant variables.
2) Applying Ensemble Methods: We used the Boruta Method to feature ranking from a random forest approach. Boruta algorithm is a wrapper built around the random forest classification and it gives a numerical estimate of the feature importance.
3) Recursive Feature Elimination (RFE): This wrapper method uses a recursive algorithm to iteratively remove the least significant attributes until a final set of the most important variables is ahieved.
Note: We had consider to use Principal Components Analysis (PCA), but ultimately decided not because it is primarily designed for numerical features and may not be directly applicable to categorical or discrete features. PCA works by identifying linear combinations of features that capture the maximum variance in the data and while it is technically possible to use PCA on categorical variables (by hot encoding them), that is not an optimal approach because the encoding process introduces artificial relationships between categories that may not reflect the true underlying structure of the data.
correlation_matrix = data.corr(numeric_only = True)
correlation_matrix
| churn | accountlength | internationalplan | voicemailplan | numbervmailmessages | totaldayminutes | totaldaycalls | totaldaycharge | totaleveminutes | totalevecalls | totalevecharge | totalnightminutes | totalnightcalls | totalnightcharge | totalintlminutes | totalintlcalls | totalintlcharge | numbercustomerservicecalls | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| churn | 1.000000 | 0.017599 | 0.223803 | -0.096899 | -0.098410 | 0.204437 | 0.015782 | 0.205893 | 0.089790 | -0.007707 | 0.089355 | 0.043279 | -0.007284 | 0.045688 | 0.061712 | -0.044919 | 0.059816 | 0.216391 |
| accountlength | 0.017599 | 1.000000 | 0.004607 | -0.020104 | -0.017260 | -0.001634 | 0.026951 | -0.000937 | -0.007911 | 0.011359 | -0.008081 | 0.003669 | -0.014938 | 0.003219 | 0.000348 | 0.013088 | -0.000795 | -0.001303 |
| internationalplan | 0.223803 | 0.004607 | 1.000000 | -0.002055 | 0.006188 | 0.022611 | 0.000579 | 0.023479 | 0.032190 | -0.018647 | 0.032162 | -0.007305 | 0.016493 | -0.009109 | 0.022755 | 0.002776 | 0.021917 | -0.015707 |
| voicemailplan | -0.096899 | -0.020104 | -0.002055 | 1.000000 | 0.876245 | -0.000694 | 0.006088 | -0.000403 | 0.030899 | -0.003010 | 0.030819 | 0.004525 | 0.010740 | 0.004073 | -0.002994 | -0.000448 | -0.003718 | -0.005681 |
| numbervmailmessages | -0.098410 | -0.017260 | 0.006188 | 0.876245 | 1.000000 | 0.007037 | 0.003585 | 0.007447 | 0.020998 | -0.007010 | 0.021033 | 0.005377 | 0.000477 | 0.005103 | 0.005578 | -0.004427 | 0.003519 | -0.009512 |
| totaldayminutes | 0.204437 | -0.001634 | 0.022611 | -0.000694 | 0.007037 | 1.000000 | 0.002713 | 0.989143 | -0.010498 | 0.008244 | -0.011064 | 0.009606 | 0.004540 | 0.009887 | -0.021864 | 0.002713 | -0.017980 | -0.004387 |
| totaldaycalls | 0.015782 | 0.026951 | 0.000579 | 0.006088 | 0.003585 | 0.002713 | 1.000000 | 0.000978 | -0.000310 | 0.002941 | 0.000457 | 0.005971 | -0.005723 | 0.004675 | 0.015280 | 0.011587 | 0.012926 | -0.008329 |
| totaldaycharge | 0.205893 | -0.000937 | 0.023479 | -0.000403 | 0.007447 | 0.989143 | 0.000978 | 1.000000 | -0.008848 | 0.008896 | -0.009529 | 0.011760 | 0.005581 | 0.012025 | -0.022457 | 0.000308 | -0.018685 | -0.000983 |
| totaleveminutes | 0.089790 | -0.007911 | 0.032190 | 0.030899 | 0.020998 | -0.010498 | -0.000310 | -0.008848 | 1.000000 | 0.004539 | 0.993895 | -0.017636 | 0.013641 | -0.017142 | 0.000512 | 0.010567 | -0.005727 | -0.012021 |
| totalevecalls | -0.007707 | 0.011359 | -0.018647 | -0.003010 | -0.007010 | 0.008244 | 0.002941 | 0.008896 | 0.004539 | 1.000000 | 0.003882 | 0.003286 | -0.013501 | 0.002488 | -0.006834 | 0.001308 | -0.005892 | 0.005579 |
| totalevecharge | 0.089355 | -0.008081 | 0.032162 | 0.030819 | 0.021033 | -0.011064 | 0.000457 | -0.009529 | 0.993895 | 0.003882 | 1.000000 | -0.017165 | 0.011710 | -0.016676 | 0.001482 | 0.010663 | -0.004732 | -0.012234 |
| totalnightminutes | 0.043279 | 0.003669 | -0.007305 | 0.004525 | 0.005377 | 0.009606 | 0.005971 | 0.011760 | -0.017636 | 0.003286 | -0.017165 | 1.000000 | 0.025320 | 0.993036 | -0.006254 | -0.019283 | -0.005600 | -0.006955 |
| totalnightcalls | -0.007284 | -0.014938 | 0.016493 | 0.010740 | 0.000477 | 0.004540 | -0.005723 | 0.005581 | 0.013641 | -0.013501 | 0.011710 | 0.025320 | 1.000000 | 0.027563 | 0.000101 | -0.001553 | 0.002806 | -0.006382 |
| totalnightcharge | 0.045688 | 0.003219 | -0.009109 | 0.004073 | 0.005103 | 0.009887 | 0.004675 | 0.012025 | -0.017142 | 0.002488 | -0.016676 | 0.993036 | 0.027563 | 1.000000 | -0.007700 | -0.018429 | -0.007068 | -0.008293 |
| totalintlminutes | 0.061712 | 0.000348 | 0.022755 | -0.002994 | 0.005578 | -0.021864 | 0.015280 | -0.022457 | 0.000512 | -0.006834 | 0.001482 | -0.006254 | 0.000101 | -0.007700 | 1.000000 | 0.018314 | 0.984733 | -0.011243 |
| totalintlcalls | -0.044919 | 0.013088 | 0.002776 | -0.000448 | -0.004427 | 0.002713 | 0.011587 | 0.000308 | 0.010567 | 0.001308 | 0.010663 | -0.019283 | -0.001553 | -0.018429 | 0.018314 | 1.000000 | 0.021163 | -0.018942 |
| totalintlcharge | 0.059816 | -0.000795 | 0.021917 | -0.003718 | 0.003519 | -0.017980 | 0.012926 | -0.018685 | -0.005727 | -0.005892 | -0.004732 | -0.005600 | 0.002806 | -0.007068 | 0.984733 | 0.021163 | 1.000000 | -0.011175 |
| numbercustomerservicecalls | 0.216391 | -0.001303 | -0.015707 | -0.005681 | -0.009512 | -0.004387 | -0.008329 | -0.000983 | -0.012021 | 0.005579 | -0.012234 | -0.006955 | -0.006382 | -0.008293 | -0.011243 | -0.018942 | -0.011175 | 1.000000 |
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.show()
From the Correlation Matrix and the Bivariate analysis performed, we detected a few highly correlated variables that may be redundant to the problem. This was the case of the 'total count of minutes' and 'total charge' variables within each group, which corroborates our own business perception of the problem, since normally, the charge value of a call is calculated directly from the minute’s duration of that call.
With this in mind, removing the variables: totaldayminutes, totaleveminutes, totalnightcharge, totalintminutes could be a valid approach since between each pair they were the ones less correlated with the label feature and removing them would allow reduce the dimensionality of the dataset without compromising drastically the predictive performance of the future model.
Boruta Method¶
np.int = np.int32
np.float = np.float64
np.bool = np.bool_
X = data.drop('churn', axis=1)
y = data['churn']
forest = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5, random_state=42)
boruta_selector = BorutaPy(forest, n_estimators='auto', verbose=2, random_state=42)
boruta_selector.fit(X.values, y.values)
Iteration: 1 / 100 Confirmed: 0 Tentative: 17 Rejected: 0 Iteration: 2 / 100 Confirmed: 0 Tentative: 17 Rejected: 0 Iteration: 3 / 100 Confirmed: 0 Tentative: 17 Rejected: 0 Iteration: 4 / 100 Confirmed: 0 Tentative: 17 Rejected: 0 Iteration: 5 / 100 Confirmed: 0 Tentative: 17 Rejected: 0 Iteration: 6 / 100 Confirmed: 0 Tentative: 17 Rejected: 0 Iteration: 7 / 100 Confirmed: 0 Tentative: 17 Rejected: 0 Iteration: 8 / 100 Confirmed: 13 Tentative: 0 Rejected: 4 BorutaPy finished running. Iteration: 9 / 100 Confirmed: 13 Tentative: 0 Rejected: 4
BorutaPy(estimator=RandomForestClassifier(class_weight='balanced', max_depth=5,
n_estimators=116, n_jobs=-1,
random_state=RandomState(MT19937) at 0x2A81CE80840),
n_estimators='auto',
random_state=RandomState(MT19937) at 0x2A81CE80840, verbose=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BorutaPy(estimator=RandomForestClassifier(class_weight='balanced', max_depth=5,
n_estimators=116, n_jobs=-1,
random_state=RandomState(MT19937) at 0x2A81CE80840),
n_estimators='auto',
random_state=RandomState(MT19937) at 0x2A81CE80840, verbose=2)RandomForestClassifier(class_weight='balanced', max_depth=5, n_estimators=116,
n_jobs=-1,
random_state=RandomState(MT19937) at 0x2A81CE80840)RandomForestClassifier(class_weight='balanced', max_depth=5, n_estimators=116,
n_jobs=-1,
random_state=RandomState(MT19937) at 0x2A81CE80840)selected_features= X.columns[boruta_selector.support_].tolist()
removed_features= X.columns[~boruta_selector.support_].tolist()
print("Selected Features:", selected_features)
print("Removed Features:", removed_features)
Selected Features: ['internationalplan', 'voicemailplan', 'numbervmailmessages', 'totaldayminutes', 'totaldaycharge', 'totaleveminutes', 'totalevecharge', 'totalnightminutes', 'totalnightcharge', 'totalintlminutes', 'totalintlcalls', 'totalintlcharge', 'numbercustomerservicecalls'] Removed Features: ['accountlength', 'totaldaycalls', 'totalevecalls', 'totalnightcalls']
forest.fit(X.values, y.values)
feature_importances = forest.feature_importances_
feature_names = X.columns
sorted_idx = np.argsort(feature_importances)[::-1]
plt.figure(figsize=(12, 6))
plt.bar(range(X.shape[1]), feature_importances[sorted_idx], align="center")
plt.xticks(range(X.shape[1]), np.array(feature_names)[sorted_idx], rotation=90)
plt.xlabel("Feature")
plt.ylabel("Importance")
plt.title("Feature Importances after Boruta")
plt.show()
Recursive Feature Elimination¶
First we trainned our model with different values of "n_features_to_select" and evaluate its performance using a validation set or cross-validation.
feature_range = range(9, len(X.columns) + 1,2)
cv_scores = []
for n in feature_range:
rfe_selector = RFE(estimator=forest, n_features_to_select=n, step=1)
scores = cross_val_score(rfe_selector, X, y, cv=5, scoring='accuracy')
cv_scores.append(scores.mean())
plt.plot(feature_range, cv_scores)
plt.xlabel('Number of Features Selected')
plt.ylabel('Cross-Validation Accuracy')
plt.title('RFE Performance')
plt.show()
Looking for an "elbow" point in the plot where the performance gain starts to diminish, we choose the value of features to be selected as 11.
n_features_to_select = min(11, X.shape[1])
rfe_selector = RFE(estimator=forest, n_features_to_select=n_features_to_select, step=1)
rfe_selector.fit(X, y)
final_selected_features = X.columns[rfe_selector.support_]
removed_features = X.columns[~rfe_selector.support_]
feature_importances = rfe_selector.estimator_.feature_importances_
sorted_idx = np.argsort(feature_importances)[::-1]
plt.figure(figsize=(12, 6))
plt.bar(range(len(final_selected_features)), feature_importances[sorted_idx], align="center")
plt.xticks(range(len(final_selected_features)), np.array(final_selected_features)[sorted_idx], rotation=90)
plt.xlabel("Feature")
plt.ylabel("Importance")
plt.title("Feature Importances after RFE")
plt.show()
print("Final Selected Features:", final_selected_features)
print("Removed Features:", removed_features)
Final Selected Features: Index(['internationalplan', 'voicemailplan', 'numbervmailmessages',
'totaldayminutes', 'totaldaycharge', 'totaleveminutes',
'totalevecharge', 'totalintlminutes', 'totalintlcalls',
'totalintlcharge', 'numbercustomerservicecalls'],
dtype='object')
Removed Features: Index(['accountlength', 'totaldaycalls', 'totalevecalls', 'totalnightminutes',
'totalnightcalls', 'totalnightcharge'],
dtype='object')
It's interesting to observe that both Boruta and Recursive Feature Elimination (RFE) independently interced the same variables for removal (the 4 variables to be excluded in the boruta method are also present in the removed features of the RFE method). This could indicate that these particular features might indeed have lower importance or contribution to the predictive performance of the model.
A new dataset without these 4 variables might be a reasonable approach.
3.4 - Construct Data: Discretising¶
Discretizing data converts continuous data into discrete or categorical data, which makes data more manageable for machine learning algorithms. It reduces data dimensionality, improves data quality and enhances model performance, by making the data more compatible with the algorithms and reducing overfitting.
accountlength_array = data['accountlength'].to_numpy()
pd.cut(accountlength_array, 10)
[(122.0, 146.2], (97.8, 122.0], (122.0, 146.2], (73.6, 97.8], (73.6, 97.8], ..., (49.4, 73.6], (146.2, 170.4], (49.4, 73.6], (97.8, 122.0], (73.6, 97.8]] Length: 5000 Categories (10, interval[float64, right]): [(0.758, 25.2] < (25.2, 49.4] < (49.4, 73.6] < (73.6, 97.8] ... (146.2, 170.4] < (170.4, 194.6] < (194.6, 218.8] < (218.8, 243.0]]
totaldayminutes_array = data['totaldayminutes'].to_numpy()
pd.cut(totaldayminutes_array, 10)
[(246.05, 281.2], (140.6, 175.75], (210.9, 246.05], (281.2, 316.35], (140.6, 175.75], ..., (210.9, 246.05], (175.75, 210.9], (105.45, 140.6], (175.75, 210.9], (105.45, 140.6]] Length: 5000 Categories (10, interval[float64, right]): [(-0.352, 35.15] < (35.15, 70.3] < (70.3, 105.45] < (105.45, 140.6] ... (210.9, 246.05] < (246.05, 281.2] < (281.2, 316.35] < (316.35, 351.5]]
totaldaycalls_array = data['totaldaycalls'].to_numpy()
pd.cut(totaldaycalls_array, 10)
[(99.0, 115.5], (115.5, 132.0], (99.0, 115.5], (66.0, 82.5], (99.0, 115.5], ..., (115.5, 132.0], (82.5, 99.0], (82.5, 99.0], (66.0, 82.5], (99.0, 115.5]] Length: 5000 Categories (10, interval[float64, right]): [(-0.165, 16.5] < (16.5, 33.0] < (33.0, 49.5] < (49.5, 66.0] ... (99.0, 115.5] < (115.5, 132.0] < (132.0, 148.5] < (148.5, 165.0]]
totaldaycharge_array = data['totaldaycharge'].to_numpy()
pd.cut(totaldaycharge_array, 10)
[(41.832, 47.808], (23.904, 29.88], (35.856, 41.832], (47.808, 53.784], (23.904, 29.88], ..., (35.856, 41.832], (29.88, 35.856], (17.928, 23.904], (29.88, 35.856], (17.928, 23.904]] Length: 5000 Categories (10, interval[float64, right]): [(-0.0598, 5.976] < (5.976, 11.952] < (11.952, 17.928] < (17.928, 23.904] ... (35.856, 41.832] < (41.832, 47.808] < (47.808, 53.784] < (53.784, 59.76]]
totaleveminutes_array = data['totaleveminutes'].to_numpy()
pd.cut(totaleveminutes_array, 10)
[(181.85, 218.22], (181.85, 218.22], (109.11, 145.48], (36.37, 72.74], (145.48, 181.85], ..., (218.22, 254.59], (254.59, 290.96], (145.48, 181.85], (145.48, 181.85], (254.59, 290.96]] Length: 5000 Categories (10, interval[float64, right]): [(-0.364, 36.37] < (36.37, 72.74] < (72.74, 109.11] < (109.11, 145.48] ... (218.22, 254.59] < (254.59, 290.96] < (290.96, 327.33] < (327.33, 363.7]]
totalevecalls_array = data['totalevecalls'].to_numpy()
pd.cut(totalevecalls_array, 10)
[(85.0, 102.0], (102.0, 119.0], (102.0, 119.0], (85.0, 102.0], (119.0, 136.0], ..., (119.0, 136.0], (68.0, 85.0], (119.0, 136.0], (85.0, 102.0], (102.0, 119.0]] Length: 5000 Categories (10, interval[float64, right]): [(-0.17, 17.0] < (17.0, 34.0] < (34.0, 51.0] < (51.0, 68.0] ... (102.0, 119.0] < (119.0, 136.0] < (136.0, 153.0] < (153.0, 170.0]]
totalevecharge_array = data['totalevecharge'].to_numpy()
pd.cut(totalevecharge_array, 10)
[(15.455, 18.546], (15.455, 18.546], (9.273, 12.364], (3.091, 6.182], (12.364, 15.455], ..., (18.546, 21.637], (21.637, 24.728], (12.364, 15.455], (12.364, 15.455], (21.637, 24.728]] Length: 5000 Categories (10, interval[float64, right]): [(-0.0309, 3.091] < (3.091, 6.182] < (6.182, 9.273] < (9.273, 12.364] ... (18.546, 21.637] < (21.637, 24.728] < (24.728, 27.819] < (27.819, 30.91]]
totalnightminutes_array = data['totalnightminutes'].to_numpy()
pd.cut(totalnightminutes_array, 10)
[(237.0, 276.5], (237.0, 276.5], (158.0, 197.5], (158.0, 197.5], (158.0, 197.5], ..., (276.5, 316.0], (197.5, 237.0], (197.5, 237.0], (197.5, 237.0], (118.5, 158.0]] Length: 5000 Categories (10, interval[float64, right]): [(-0.395, 39.5] < (39.5, 79.0] < (79.0, 118.5] < (118.5, 158.0] ... (237.0, 276.5] < (276.5, 316.0] < (316.0, 355.5] < (355.5, 395.0]]
totalnightcalls_array = data['totalnightcalls'].to_numpy()
pd.cut(totalnightcalls_array, 10)
[(87.5, 105.0], (87.5, 105.0], (87.5, 105.0], (87.5, 105.0], (105.0, 122.5], ..., (105.0, 122.5], (105.0, 122.5], (87.5, 105.0], (87.5, 105.0], (87.5, 105.0]] Length: 5000 Categories (10, interval[float64, right]): [(-0.175, 17.5] < (17.5, 35.0] < (35.0, 52.5] < (52.5, 70.0] ... (105.0, 122.5] < (122.5, 140.0] < (140.0, 157.5] < (157.5, 175.0]]
totalnightcharge_array = data['totalnightcharge'].to_numpy()
pd.cut(totalnightcharge_array, 10)
[(10.662, 12.439], (10.662, 12.439], (7.108, 8.885], (7.108, 8.885], (7.108, 8.885], ..., (12.439, 14.216], (8.885, 10.662], (8.885, 10.662], (8.885, 10.662], (5.331, 7.108]] Length: 5000 Categories (10, interval[float64, right]): [(-0.0178, 1.777] < (1.777, 3.554] < (3.554, 5.331] < (5.331, 7.108] ... (10.662, 12.439] < (12.439, 14.216] < (14.216, 15.993] < (15.993, 17.77]]
totalintlminutes_array = data['totalintlminutes'].to_numpy()
pd.cut(totalintlminutes_array, 10)
[(8.0, 10.0], (12.0, 14.0], (12.0, 14.0], (6.0, 8.0], (10.0, 12.0], ..., (8.0, 10.0], (14.0, 16.0], (12.0, 14.0], (8.0, 10.0], (8.0, 10.0]] Length: 5000 Categories (10, interval[float64, right]): [(-0.02, 2.0] < (2.0, 4.0] < (4.0, 6.0] < (6.0, 8.0] ... (12.0, 14.0] < (14.0, 16.0] < (16.0, 18.0] < (18.0, 20.0]]
totalintlcharge_array = data['totalintlcharge'].to_numpy()
pd.cut(totalintlcharge_array, 10)
[(2.16, 2.7], (3.24, 3.78], (3.24, 3.78], (1.62, 2.16], (2.7, 3.24], ..., (2.16, 2.7], (3.78, 4.32], (3.24, 3.78], (2.16, 2.7], (2.16, 2.7]] Length: 5000 Categories (10, interval[float64, right]): [(-0.0054, 0.54] < (0.54, 1.08] < (1.08, 1.62] < (1.62, 2.16] ... (3.24, 3.78] < (3.78, 4.32] < (4.32, 4.86] < (4.86, 5.4]]
One of the ways to select the number of bins into which the data will be discretized is through visualization. By plotting the data distribution in the previous step, we observe that the number of bins that might better fit our data is 10.
For continuous variables with a somewhat normal distribution, we decided to do equal-width binning. This divides the range of the data into a fixed number of bins of equal width. This method can be problematic if the data is not normally distributed.
pd.qcut(data['numbervmailmessages'], 10, duplicates='drop')
0 (24.0, 32.0]
1 (24.0, 32.0]
2 (-0.001, 24.0]
3 (-0.001, 24.0]
4 (-0.001, 24.0]
...
4995 (32.0, 52.0]
4996 (-0.001, 24.0]
4997 (-0.001, 24.0]
4998 (-0.001, 24.0]
4999 (32.0, 52.0]
Name: numbervmailmessages, Length: 5000, dtype: category
Categories (3, interval[float64, right]): [(-0.001, 24.0] < (24.0, 32.0] < (32.0, 52.0]]
pd.qcut(data['numbervmailmessages'], 10, duplicates='drop').value_counts()
numbervmailmessages (-0.001, 24.0] 4056 (24.0, 32.0] 499 (32.0, 52.0] 445 Name: count, dtype: int64
pd.qcut(data['totalintlcalls'], 10, duplicates='drop')
0 (2.0, 3.0]
1 (2.0, 3.0]
2 (4.0, 5.0]
3 (6.0, 8.0]
4 (2.0, 3.0]
...
4995 (4.0, 5.0]
4996 (-0.001, 2.0]
4997 (3.0, 4.0]
4998 (5.0, 6.0]
4999 (8.0, 20.0]
Name: totalintlcalls, Length: 5000, dtype: category
Categories (7, interval[float64, right]): [(-0.001, 2.0] < (2.0, 3.0] < (3.0, 4.0] < (4.0, 5.0] < (5.0, 6.0] < (6.0, 8.0] < (8.0, 20.0]]
pd.qcut(data['totalintlcalls'], 10, duplicates='drop').value_counts()
totalintlcalls (-0.001, 2.0] 1018 (2.0, 3.0] 993 (3.0, 4.0] 964 (4.0, 5.0] 715 (5.0, 6.0] 496 (6.0, 8.0] 480 (8.0, 20.0] 334 Name: count, dtype: int64
pd.qcut(data['numbercustomerservicecalls'], 10, duplicates='drop')
0 (-0.001, 1.0]
1 (-0.001, 1.0]
2 (-0.001, 1.0]
3 (1.0, 2.0]
4 (2.0, 3.0]
...
4995 (1.0, 2.0]
4996 (2.0, 3.0]
4997 (-0.001, 1.0]
4998 (-0.001, 1.0]
4999 (-0.001, 1.0]
Name: numbercustomerservicecalls, Length: 5000, dtype: category
Categories (4, interval[float64, right]): [(-0.001, 1.0] < (1.0, 2.0] < (2.0, 3.0] < (3.0, 9.0]]
pd.qcut(data['numbercustomerservicecalls'], 10, duplicates='drop').value_counts()
numbercustomerservicecalls (-0.001, 1.0] 2788 (1.0, 2.0] 1152 (2.0, 3.0] 668 (3.0, 9.0] 392 Name: count, dtype: int64
For continuous variables with non-normal distributions, we chose equal-frequency binning. This divides the data into a fixed number of bins of equal frequency. This method is more robust to non-normal distributions but can be more difficult to implement. The goal is to create bins with approximately the same number of data points.
The resulting number of categories may be different from the specified number of bins, which happened to our data. This can occur due to not being uniformly distributed, there may be some bins with more or fewer data points than others. The equal-frequency binning algorithm may have to create fewer bins to accommodate the smaller number of data points in some categories. Furthermore, some algorithms may be more aggressive in rounding data points to the nearest bin boundary, which can result in fewer categories. Other algorithms may be more lenient, allowing for a wider range of data points to fall into each bin. Also, precision of the binning can also affect the number of categories.
4. Modelling¶
4.1 Data shuffle and split¶
First we decided to shuffle the data before splitting it into tranning and test data, in order to make it an unbiased division. We decided to split the data allocating 80% for training, instead of 70%, because having more data for training can lead to more accurate models. On the other hand, allocating 20%, instead of 30%, for testing may provide a lesser robust evaluation.
data_shuffled = data.sample(frac = 1, random_state=10)
data_shuffled
| churn | accountlength | internationalplan | voicemailplan | numbervmailmessages | totaldayminutes | totaldaycalls | totaldaycharge | totaleveminutes | totalevecalls | totalevecharge | totalnightminutes | totalnightcalls | totalnightcharge | totalintlminutes | totalintlcalls | totalintlcharge | numbercustomerservicecalls | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 245 | 0.0 | 22.0 | 0.0 | 0.0 | 0.0 | 110.3 | 107.0 | 18.75 | 166.5 | 93.0 | 14.15 | 202.3 | 96.0 | 9.10 | 9.5 | 5.0 | 2.57 | 0.0 |
| 4493 | 0.0 | 83.0 | 0.0 | 1.0 | 37.0 | 133.1 | 117.0 | 22.63 | 171.4 | 89.0 | 14.57 | 218.1 | 92.0 | 9.81 | 10.9 | 4.0 | 2.94 | 3.0 |
| 4583 | 0.0 | 104.0 | 0.0 | 0.0 | 0.0 | 246.4 | 108.0 | 41.89 | 205.5 | 102.0 | 17.47 | 208.1 | 125.0 | 9.36 | 14.1 | 8.0 | 3.81 | 2.0 |
| 2242 | 0.0 | 192.0 | 0.0 | 0.0 | 0.0 | 185.0 | 88.0 | 31.45 | 224.9 | 98.0 | 19.12 | 212.4 | 105.0 | 9.56 | 11.4 | 3.0 | 3.08 | 2.0 |
| 3407 | 0.0 | 93.0 | 0.0 | 0.0 | 0.0 | 120.6 | 104.0 | 20.50 | 205.5 | 95.0 | 17.47 | 182.5 | 107.0 | 8.21 | 9.6 | 6.0 | 2.59 | 2.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1180 | 0.0 | 72.0 | 0.0 | 0.0 | 0.0 | 207.8 | 92.0 | 35.33 | 195.7 | 110.0 | 16.63 | 184.8 | 124.0 | 8.32 | 13.1 | 4.0 | 3.54 | 0.0 |
| 3441 | 0.0 | 56.0 | 0.0 | 0.0 | 0.0 | 254.8 | 83.0 | 43.32 | 194.0 | 98.0 | 16.49 | 186.2 | 118.0 | 8.38 | 13.5 | 2.0 | 3.65 | 2.2 |
| 1344 | 1.0 | 111.0 | 0.0 | 0.0 | 0.0 | 284.4 | 89.0 | 48.35 | 211.5 | 113.0 | 13.35 | 242.8 | 91.0 | 10.93 | 8.4 | 8.0 | 2.27 | 0.0 |
| 4623 | 0.0 | 74.0 | 0.0 | 0.0 | 0.0 | 207.1 | 79.0 | 35.21 | 182.0 | 100.0 | 15.47 | 233.7 | 73.0 | 10.52 | 7.4 | 8.0 | 2.00 | 1.0 |
| 1289 | 0.0 | 129.0 | 0.0 | 0.0 | 0.0 | 98.0 | 99.0 | 16.66 | 240.7 | 62.0 | 20.46 | 254.8 | 123.0 | 11.47 | 10.5 | 4.0 | 2.84 | 0.0 |
5000 rows × 18 columns
train_data, test_data = train_test_split(data_shuffled, test_size=0.2, random_state=10)
train_data
| churn | accountlength | internationalplan | voicemailplan | numbervmailmessages | totaldayminutes | totaldaycalls | totaldaycharge | totaleveminutes | totalevecalls | totalevecharge | totalnightminutes | totalnightcalls | totalnightcharge | totalintlminutes | totalintlcalls | totalintlcharge | numbercustomerservicecalls | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3439 | 1.0 | 110.0 | 0.0 | 0.0 | 0.0 | 220.9 | 100.0 | 37.550 | 279.6 | 76.0 | 23.77 | 239.8 | 107.0 | 10.79 | 13.5 | 9.0 | 3.650 | 1.0 |
| 4509 | 0.0 | 128.0 | 0.0 | 1.0 | 40.0 | 272.2 | 112.0 | 46.270 | 239.9 | 111.0 | 20.39 | 257.3 | 124.0 | 11.58 | 11.6 | 2.0 | 3.130 | 1.0 |
| 945 | 0.0 | 122.0 | 0.0 | 0.0 | 0.0 | 180.0 | 88.0 | 30.600 | 145.0 | 95.4 | 12.33 | 233.7 | 120.0 | 10.52 | 11.5 | 6.0 | 3.110 | 2.0 |
| 537 | 0.0 | 190.0 | 0.0 | 1.0 | 22.0 | 166.5 | 93.0 | 28.310 | 183.0 | 92.0 | 15.56 | 121.0 | 102.0 | 5.44 | 8.5 | 3.0 | 2.300 | 0.0 |
| 4887 | 0.0 | 116.0 | 0.0 | 0.0 | 0.0 | 152.4 | 101.0 | 25.910 | 191.0 | 90.0 | 16.24 | 143.0 | 96.0 | 6.44 | 11.1 | 9.0 | 3.000 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3619 | 0.0 | 81.0 | 0.0 | 1.0 | 36.0 | 177.6 | 106.0 | 28.020 | 210.2 | 138.0 | 17.87 | 173.6 | 78.0 | 7.81 | 4.3 | 2.0 | 1.160 | 2.0 |
| 456 | 0.0 | 60.0 | 0.0 | 0.0 | 0.0 | 98.2 | 88.0 | 16.690 | 180.5 | 69.0 | 15.34 | 223.6 | 69.0 | 10.06 | 9.3 | 2.0 | 2.510 | 2.0 |
| 3686 | 1.0 | 64.0 | 0.0 | 0.0 | 0.0 | 297.8 | 88.0 | 49.454 | 196.0 | 89.0 | 16.66 | 213.7 | 107.0 | 9.62 | 12.0 | 2.0 | 2.482 | 1.0 |
| 1677 | 0.0 | 123.0 | 0.0 | 0.0 | 0.0 | 163.1 | 119.0 | 27.730 | 249.4 | 51.0 | 21.20 | 168.2 | 77.0 | 7.57 | 9.0 | 10.0 | 2.430 | 1.0 |
| 4996 | 1.0 | 152.0 | 0.0 | 0.0 | 0.0 | 184.2 | 90.0 | 31.310 | 256.8 | 73.0 | 21.83 | 213.6 | 113.0 | 9.61 | 14.7 | 2.0 | 3.970 | 3.0 |
4000 rows × 18 columns
test_data
| churn | accountlength | internationalplan | voicemailplan | numbervmailmessages | totaldayminutes | totaldaycalls | totaldaycharge | totaleveminutes | totalevecalls | totalevecharge | totalnightminutes | totalnightcalls | totalnightcharge | totalintlminutes | totalintlcalls | totalintlcharge | numbercustomerservicecalls | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 845 | 0.0 | 144.0 | 0.0 | 1.0 | 51.0 | 283.9 | 98.0 | 48.26 | 192.0 | 109.0 | 16.32 | 196.3 | 85.0 | 8.83 | 10.00 | 4.0 | 2.70 | 1.0 |
| 4675 | 0.0 | 52.0 | 0.0 | 0.0 | 0.0 | 165.9 | 122.0 | 28.20 | 212.2 | 107.0 | 18.04 | 202.1 | 94.0 | 9.09 | 8.30 | 3.0 | 2.24 | 1.0 |
| 1672 | 0.0 | 95.0 | 0.0 | 0.0 | 0.0 | 203.4 | 96.0 | 34.58 | 168.6 | 61.0 | 14.33 | 173.0 | 105.0 | 7.79 | 13.70 | 3.0 | 3.70 | 2.0 |
| 3424 | 0.0 | 82.0 | 0.0 | 0.0 | 0.0 | 183.6 | 111.0 | 31.21 | 303.5 | 88.0 | 25.80 | 189.8 | 115.0 | 8.54 | 10.30 | 6.0 | 2.78 | 0.0 |
| 4756 | 0.0 | 84.0 | 0.0 | 0.0 | 0.0 | 172.3 | 92.0 | 29.29 | 224.2 | 96.0 | 19.06 | 223.4 | 100.0 | 10.05 | 9.70 | 1.0 | 2.62 | 2.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2646 | 0.0 | 101.0 | 0.0 | 2.0 | 0.0 | 232.7 | 114.0 | 39.56 | 186.4 | 123.0 | 15.84 | 153.3 | 122.0 | 6.90 | 11.50 | 6.0 | 3.11 | 5.0 |
| 1637 | 0.0 | 74.0 | 0.0 | 0.0 | 0.0 | 225.2 | 93.0 | 38.28 | 215.1 | 120.0 | 18.28 | 241.8 | 95.0 | 10.88 | 9.74 | 2.0 | 2.46 | 2.0 |
| 4045 | 0.0 | 122.0 | 0.0 | 1.0 | 21.0 | 196.0 | 104.0 | 33.32 | 194.7 | 73.0 | 16.55 | 150.8 | 92.0 | 6.79 | 9.60 | 4.0 | 2.59 | 0.0 |
| 1093 | 0.0 | 210.0 | 0.0 | 0.0 | 0.0 | 104.6 | 121.0 | 17.78 | 149.5 | 71.0 | 12.71 | 255.1 | 67.0 | 11.48 | 6.50 | 8.0 | 1.76 | 2.0 |
| 570 | 0.0 | 51.0 | 0.0 | 0.0 | 0.0 | 181.5 | 108.0 | 30.86 | 196.9 | 87.0 | 16.74 | 187.2 | 119.0 | 8.42 | 10.30 | 2.0 | 2.78 | 1.0 |
1000 rows × 18 columns
4.2 - Sampling¶
Modeling with an imbalanced dataset, where one class is significantly more prevalent than the other, can have several implications: the model might become biased toward the majority class, might have poor generalization to the minority class and accuracy can become a misleading metric.
Due to the highly unbalanced nature of our data set, we performed two sampling techniques to improve the performance of the models. We used the Synthetic Minority Over-sampling Technique (SMOTE) for over-sampling the minority class (churn = 1) and the Random Under-Sampling technique to under-sample the majority class (churn = 0).
As a result, we obtained a balanced dataset where each class has exactly the same number of instances.
X_train = train_data.drop('churn', axis=1)
y_train = np.array(train_data['churn'])
X_test = test_data.drop('churn', axis=1)
y_test = np.array(test_data['churn'])
pipeline = make_pipeline(SMOTE(sampling_strategy=0.5, random_state=10), RandomUnderSampler(sampling_strategy=1.0, random_state=10))
X_train_balanced, y_train_balanced = pipeline.fit_resample(X_train,y_train)
X_train_balanced
| accountlength | internationalplan | voicemailplan | numbervmailmessages | totaldayminutes | totaldaycalls | totaldaycharge | totaleveminutes | totalevecalls | totalevecharge | totalnightminutes | totalnightcalls | totalnightcharge | totalintlminutes | totalintlcalls | totalintlcharge | numbercustomerservicecalls | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3524 | 22.000000 | 0.000000 | 0.0 | 0.000000 | 222.900000 | 97.000000 | 37.890000 | 172.600000 | 91.000000 | 14.670000 | 187.600000 | 99.000000 | 8.440000 | 10.100000 | 3.000000 | 2.730000 | 1.000000 |
| 2478 | 172.000000 | 0.000000 | 0.0 | 0.000000 | 211.700000 | 100.000000 | 35.990000 | 198.700000 | 101.000000 | 16.890000 | 301.700000 | 136.000000 | 13.580000 | 6.500000 | 9.000000 | 1.760000 | 1.000000 |
| 717 | 96.000000 | 0.000000 | 0.0 | 0.000000 | 170.500000 | 86.000000 | 28.990000 | 277.500000 | 88.000000 | 23.590000 | 162.500000 | 117.000000 | 7.310000 | 12.200000 | 6.000000 | 3.290000 | 1.000000 |
| 3448 | 86.000000 | 0.000000 | 0.0 | 0.000000 | 178.300000 | 94.000000 | 30.310000 | 264.400000 | 74.000000 | 22.470000 | 284.600000 | 109.000000 | 12.810000 | 15.000000 | 2.000000 | 4.050000 | 0.000000 |
| 2234 | 127.000000 | 0.000000 | 0.0 | 0.000000 | 182.300000 | 124.000000 | 30.990000 | 169.900000 | 110.000000 | 14.440000 | 184.000000 | 116.000000 | 8.280000 | 9.300000 | 3.000000 | 2.510000 | 1.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5152 | 144.564041 | 0.503050 | 1.0 | 36.012198 | 240.861576 | 94.435959 | 40.943474 | 220.167675 | 125.957306 | 18.717255 | 191.531715 | 72.478653 | 8.616433 | 7.055794 | 3.490851 | 1.906555 | 1.496950 |
| 5153 | 104.668934 | 0.000000 | 0.0 | 0.000000 | 162.843956 | 81.387636 | 27.681929 | 190.288161 | 79.775271 | 16.176003 | 143.768254 | 78.262442 | 6.473756 | 9.408748 | 3.000000 | 2.542624 | 5.912519 |
| 5154 | 78.135668 | 0.946957 | 0.0 | 0.000000 | 220.712955 | 77.007821 | 37.523043 | 276.288609 | 121.310439 | 23.480983 | 239.669738 | 112.371302 | 10.787426 | 11.230262 | 4.893914 | 3.033852 | 1.053043 |
| 5155 | 82.899100 | 0.000000 | 0.0 | 0.000000 | 288.970929 | 100.911090 | 49.127982 | 233.689910 | 97.341659 | 19.860889 | 223.773327 | 147.696303 | 10.072148 | 12.779720 | 2.189810 | 3.451638 | 1.924076 |
| 5156 | 87.024894 | 0.000000 | 0.0 | 0.000000 | 93.093592 | 102.256886 | 15.826372 | 212.238268 | 116.154662 | 18.041009 | 225.963824 | 120.281779 | 10.171782 | 8.556356 | 3.359110 | 2.310498 | 3.051112 |
3438 rows × 17 columns
a) Modelling with the balanced dataset¶
a.1) Nearest neighbor¶
When training a kNN classifier, it is necessary to normalize the features because kNN measures the distance between points, and we do not want some variables to be weighted more than others. The Euclidean Distance is usually used, which is the square root of the sum of the squared differences between two points.
We normalized the data after splitting it into training and test sets so that normalization would not give the model additional information about the test set if we normalized all the data at once.
scaler = StandardScaler()
X_train_balanced_scaled = scaler.fit_transform(X_train_balanced)
X_test_scaled = scaler.transform(X_test)
To start training the model we used a fixed value of k=3, which will later be optimized.
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_balanced_scaled, y_train_balanced)
KNeighborsClassifier(n_neighbors=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=3)
y_pred = knn.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Accuracy: 0.81
We may be able to get a better accuracy value by optimizing our value of k. In order to do this, we loop through many different k values.
k_values = [i for i in range (1,30)]
scores = []
scaler = StandardScaler()
X_train_balanced_scaled = scaler.fit_transform(X_train_balanced_scaled)
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train_balanced_scaled, y_train_balanced)
y_pred = knn.predict(X_test_scaled)
score = accuracy_score(y_test, y_pred)
scores.append(np.mean(score))
sns.lineplot(x = k_values, y = scores, marker = 'o')
plt.xlabel("K Values")
plt.ylabel("Accuracy Score")
Text(0, 0.5, 'Accuracy Score')
We can see from our chart that k = 9 and k = 13 appears to have the highest accuracy score. So we decided to choose the lowest one, k = 9.
knn = KNeighborsClassifier(n_neighbors=9)
knn.fit(X_train_balanced_scaled, y_train_balanced)
y_pred = knn.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Accuracy: 0.801
These are the accuracy, precision, and recall scores for our model, using the best k score of 9:
knn_b = KNeighborsClassifier(n_neighbors=9)
knn.fit(X_train_balanced_scaled, y_train_balanced)
y_pred = knn.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
Accuracy: 0.801 Precision: 0.40217391304347827 Recall: 0.7655172413793103 F1 Score: 0.5273159144893111
Accuracy is the proportion of all the predictions that are correct, precision is the proportion of predicted positive labels that are actually positive, and recall is the proportion of actual positive labels that are correctly identified as positive.
As we can see, the accuracy value is higher using k=9 than it was when we initially used k=3. Although the accuracy of this model is considerably high, precision and recall are not, with recall being under 50%, which is reflected on a low F1 score.
print("Classification Report for X_train_balanced data:")
print(classification_report(y_test, y_pred, zero_division=0))
correct_predictions = (y_pred == y_test)
num_correct = sum(correct_predictions)
print(f"Number of correct predictions: {num_correct}")
Classification Report for X_train_balanced data:
precision recall f1-score support
0.0 0.95 0.81 0.87 855
1.0 0.40 0.77 0.53 145
accuracy 0.80 1000
macro avg 0.68 0.79 0.70 1000
weighted avg 0.87 0.80 0.82 1000
Number of correct predictions: 801
KNN_conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(4, 4))
sns.heatmap(KNN_conf_matrix, annot=True, fmt='g', cmap='Blues',
square=True,
xticklabels=['No', 'Yes'],
yticklabels=['No', 'Yes'])
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix of KNN')
plt.show()
a.2) Bayesian classifier¶
Bayesian classifiers are a class of probabilistic classifiers based on Bayes' theorem. In this work we used the MixedNB method, that handles both continuous and categorical variables.
First we scaled the numerical features using StandardScaler (normally distributed features) and MinMaxScaler (features that don’t follow a normal distribution). This preprocessing step is essential for scaling numerical features to have comparable ranges and magnitudes, enhancing the performance of this machine learning algorithm.
X_train_balanced = X_train_balanced.loc[:, X_train_balanced.columns != 'churn']
y_train_balanced = y_train_balanced.astype(int)
scaler_standard = StandardScaler()
numeric_features = ['accountlength', 'totaldayminutes', 'totaldaycalls', 'totaldaycharge',
'totaleveminutes', 'totalevecalls', 'totalevecharge', 'totalnightminutes', 'totalnightcalls',
'totalnightcharge', 'totalintlminutes', 'totalintlcharge']
scaler_minmax = MinMaxScaler()
numeric_features_minmax = ['numbervmailmessages', 'totalintlcalls', 'numbercustomerservicecalls']
X_numeric_train_balanced = X_train_balanced[numeric_features]
X_numeric_train_balanced_scaled = scaler_standard.fit_transform(X_numeric_train_balanced)
X_train_balanced_scaled = X_train_balanced.copy()
X_train_balanced_scaled[numeric_features] = X_numeric_train_balanced_scaled
X_numeric_minmax = X_train_balanced_scaled[numeric_features_minmax]
X_numeric_minmax_scaled = scaler_minmax.fit_transform(X_numeric_minmax)
X_train_balanced_scaled[numeric_features_minmax] = X_numeric_minmax_scaled
model_NB_b = MixedNB()
model_NB_b.fit(X_train_balanced_scaled, y_train_balanced)
preds = model_NB_b.predict(X_train_balanced_scaled)
accuracy = accuracy_score(y_train_balanced, preds)
print(f"Accuracy on X_train_balanced data: {accuracy:.2f}")
print("Classification Report for X_train_balanced data:")
print(classification_report(y_train_balanced, preds, zero_division=0))
correct_predictions = (preds == y_train_balanced)
num_correct = sum(correct_predictions)
print(f"Number of correct predictions: {num_correct}")
conf_matrix = confusion_matrix(y_train_balanced, preds)
plt.figure(figsize=(4, 4))
sns.heatmap(conf_matrix, annot=True, fmt='g', cmap='Blues',
square=True,
xticklabels=['No', 'Yes'],
yticklabels=['No', 'Yes'])
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix of NB')
plt.show()
Accuracy on X_train_balanced data: 0.78
Classification Report for X_train_balanced data:
precision recall f1-score support
0 0.81 0.74 0.78 1719
1 0.76 0.82 0.79 1719
accuracy 0.78 3438
macro avg 0.79 0.78 0.78 3438
weighted avg 0.79 0.78 0.78 3438
Number of correct predictions: 2696
The accuracy achieved on the balanced training indicates a reasonable performance (77%), however presented a balanced precision for both classes (78% versus 76% for not not churned and churn, respectively). Furthermore the recall (75% and 79% for not churned and churn, respectively) and f1-scores (77% and 78% for not churned and churn, respectively) also indicated a balanced performance across the two classes.
a.3) Decision Tree¶
First we decided to develop a not-prunned decision tree, visualize the model and see its scores of accuracy, recal, precision and f1.
model = DecisionTreeClassifier(random_state=0)
model.fit(X_train_balanced,y_train_balanced)
DecisionTreeClassifier(random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=0)
plt.subplots(nrows = 1,ncols = 1,figsize = (12,12), dpi=200)
tree.plot_tree(model, fontsize=5, filled=True)
plt.title('Decision Tree Not Prunned')
plt.show()
preds = model.predict(X_train_balanced)
(vals,ans)=np.unique(preds==y_train_balanced, return_counts=True)
ans[vals]/sum(ans)
array([1.])
def scores(model):
preds = model.predict(X_test)
print("Classification Report for test data:")
print(classification_report(y_test, preds, zero_division=0))
scores(model)
Classification Report for test data:
precision recall f1-score support
0.0 0.96 0.90 0.93 855
1.0 0.58 0.79 0.67 145
accuracy 0.89 1000
macro avg 0.77 0.84 0.80 1000
weighted avg 0.91 0.89 0.89 1000
Despite having 100% accuracy on the training data, the model has a lower accuracy value on the test data, which it is a clear indication of overfitting.
To address this issue, we consider pruning the decision tree by varying some of the hyperparameters:
max_depth=[]
accuracies = []
accuracies_train = []
for n in range(1,20,2):
model = DecisionTreeClassifier(random_state=0,max_depth=n)
model.fit(X_train_balanced,y_train_balanced)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
y_pred_train = model.predict(X_train_balanced)
accuracy_train = accuracy_score(y_train_balanced, y_pred_train)
accuracies_train.append(accuracy_train)
max_depth.append(n)
accuracies.append(accuracy)
print(f"With Max Depth of {n} the accuracy on the train data is {accuracy_train} and the accuracy on the test data is {accuracy}")
With Max Depth of 1 the accuracy on the train data is 0.673065735892961 and the accuracy on the test data is 0.832 With Max Depth of 3 the accuracy on the train data is 0.8528214077952297 and the accuracy on the test data is 0.825 With Max Depth of 5 the accuracy on the train data is 0.9162303664921466 and the accuracy on the test data is 0.904 With Max Depth of 7 the accuracy on the train data is 0.9432809773123909 and the accuracy on the test data is 0.919 With Max Depth of 9 the accuracy on the train data is 0.9616055846422339 and the accuracy on the test data is 0.914 With Max Depth of 11 the accuracy on the train data is 0.9720767888307156 and the accuracy on the test data is 0.907 With Max Depth of 13 the accuracy on the train data is 0.9872018615474113 and the accuracy on the test data is 0.901 With Max Depth of 15 the accuracy on the train data is 0.997673065735893 and the accuracy on the test data is 0.894 With Max Depth of 17 the accuracy on the train data is 1.0 and the accuracy on the test data is 0.886 With Max Depth of 19 the accuracy on the train data is 1.0 and the accuracy on the test data is 0.886
plt.figure(figsize=(10, 6))
plt.plot(max_depth, accuracies_train, label='Train Accuracy', marker='o')
plt.plot(max_depth, accuracies, label='Test Accuracy', marker='o')
plt.title('Decision Tree Accuracy vs. Max Depth')
plt.xlabel('Max Depth')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.show()
We've conducted a valuable experiment to analyze the effect of varying the max depth of the decision tree on both the training and test accuracy. Our findings show that:
- As we increase the max depth of the decision tree, the model becomes more complex and is able to capture more intricate details of the training data.
- The graph suggests that the accuracy on the test set peaks at a max depth of 7. Beyond a max depth of 7, the model starts to overfit the training data and loses its ability to generalize to new, unseen data, resulting in a decrease in accuracy on the test set.
model = DecisionTreeClassifier(random_state=0,max_depth=7)
model.fit(X_train_balanced,y_train_balanced)
plt.subplots(nrows = 1,ncols = 1,figsize = (14,14), dpi=200)
tree.plot_tree(model, fontsize=5, filled=True)
plt.title('Decision Tree with Max Depth Value of 7')
plt.show()
Manually iterating over a range of values for all the hyperparameters takes a lot of time and is not the most systematic and efficient way to tune a model. Instead we decide to do a cross validation approach using the GridSearchCV, which helps finding the best combination of hyperparameters in a more automated way. In addition, the performance metrics obtained through cross-validation are generally more reliable than a single train/test split.
model = DecisionTreeClassifier(random_state = 0)
param_grid = {
'max_depth': [None,8,14,16],
'min_samples_split': [2,5,10],
'ccp_alpha':[0.0001,0.001,0.01]
}
grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train_balanced, y_train_balanced)
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=0),
param_grid={'ccp_alpha': [0.0001, 0.001, 0.01],
'max_depth': [None, 8, 14, 16],
'min_samples_split': [2, 5, 10]})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=0),
param_grid={'ccp_alpha': [0.0001, 0.001, 0.01],
'max_depth': [None, 8, 14, 16],
'min_samples_split': [2, 5, 10]})DecisionTreeClassifier(random_state=0)
DecisionTreeClassifier(random_state=0)
best_dt_model = grid_search.best_estimator_
best_dt_model
DecisionTreeClassifier(ccp_alpha=0.001, max_depth=16, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(ccp_alpha=0.001, max_depth=16, random_state=0)
plt.subplots(nrows = 1,ncols = 1,figsize = (14,14), dpi=200)
tree.plot_tree(best_dt_model, fontsize=5, filled=True)
plt.title('Decision Tree with the best parameters defined from a Cross Validation Method')
plt.show()
scores(best_dt_model)
Classification Report for test data:
precision recall f1-score support
0.0 0.97 0.93 0.95 855
1.0 0.65 0.82 0.73 145
accuracy 0.91 1000
macro avg 0.81 0.87 0.84 1000
weighted avg 0.92 0.91 0.91 1000
preds = best_dt_model.predict(X_test)
cm = confusion_matrix(y_test, preds)
plt.figure(figsize=(4, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=True)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix of Decision Tree Model')
plt.show()
It's important to notice that the GridSearchCV indicate the value of max_depth = None, as the best one, instead of our previous conclusion. This may happen because hyperparameters can interact with each other. The optimal value for one hyperparameter might depend on the values of others and it's possible that the combination of hyperparameters led to a higher max_depth being selected.
Now, with a different approach, we decided to tune the hyperparameters in order to achieve the best possible f1 score. However, both models showed similiar values of accuracy and f1. The reasons can be that the specific hyperparameter values that work well for one metric might also work well for another and we end up with similar models.
model_f1 = DecisionTreeClassifier(random_state = 0)
grid_search_f1 = GridSearchCV(model_f1, param_grid, scoring='f1', cv=5)
grid_search_f1.fit(X_train_balanced, y_train_balanced)
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=0),
param_grid={'ccp_alpha': [0.0001, 0.001, 0.01],
'max_depth': [None, 8, 14, 16],
'min_samples_split': [2, 5, 10]},
scoring='f1')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=0),
param_grid={'ccp_alpha': [0.0001, 0.001, 0.01],
'max_depth': [None, 8, 14, 16],
'min_samples_split': [2, 5, 10]},
scoring='f1')DecisionTreeClassifier(random_state=0)
DecisionTreeClassifier(random_state=0)
best_dt_model_f1_b = grid_search_f1.best_estimator_
best_dt_model_f1_b
DecisionTreeClassifier(ccp_alpha=0.001, max_depth=16, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(ccp_alpha=0.001, max_depth=16, random_state=0)
scores(best_dt_model_f1_b)
Classification Report for test data:
precision recall f1-score support
0.0 0.97 0.93 0.95 855
1.0 0.65 0.82 0.73 145
accuracy 0.91 1000
macro avg 0.81 0.87 0.84 1000
weighted avg 0.92 0.91 0.91 1000
preds = best_dt_model_f1_b.predict(X_test)
cm = confusion_matrix(y_test, preds)
plt.figure(figsize=(4, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=True)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix of the F1 Tunned Decision Tree Model')
plt.show()
a.4) Tree Ensembles Methods¶
We decided to apply Bagging, Random Forest and Boosting models. Our approach was defined by the following steps:
- Tune the number of estimators, number of trees, to be used (generally, the higher the number of trees, more stable are our predictions but the computational cost is involved)
- Tune any other hyperparameters
- Apply and fit the final model
- Evaluate the models and compare them through a ROC curve
In the first step, we fine-tuned the number of estimators with a smaller set of data to speed up the hyperparameter tuning process while still obtaining a reasonable estimate of the model's perfomance.
X_train_tune, X_train_final, y_train_tune, y_train_final = train_test_split(X_train_balanced, y_train_balanced, test_size=0.8, random_state=0)
bag = BaggingClassifier(random_state=0)
param_grid = {
'n_estimators': [100, 150, 200, 250]
}
grid_search = GridSearchCV(bag, param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train_tune, y_train_tune)
best_bag_b = grid_search.best_estimator_
best_bag_b
BaggingClassifier(n_estimators=200, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BaggingClassifier(n_estimators=200, random_state=0)
base_tree = DecisionTreeClassifier()
best_bag_b.fit(X_train_balanced, y_train_balanced)
BaggingClassifier(n_estimators=200, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BaggingClassifier(n_estimators=200, random_state=0)
For the Random Forest model, in addition to the number of trees, we have to define the maximum number of features each individual tree in the forest is allowed to use when splitting a node. We can set it to None, sqrt or log2. If the dataset has very large number of features, this hyperparameter helps in controlling overfitting.
rf = RandomForestClassifier(random_state=0)
param_grid = {
'n_estimators': [100, 150, 200, 250],
'max_features': [None, 'sqrt', 'log2']
}
grid_search = GridSearchCV(rf, param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train_tune, y_train_tune)
best_rf_b = grid_search.best_estimator_
best_rf_b
RandomForestClassifier(n_estimators=200, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(n_estimators=200, random_state=0)
best_rf_b.fit(X_train_balanced, y_train_balanced)
scores(best_rf_b)
Classification Report for test data:
precision recall f1-score support
0.0 0.97 0.96 0.96 855
1.0 0.76 0.83 0.79 145
accuracy 0.94 1000
macro avg 0.87 0.89 0.88 1000
weighted avg 0.94 0.94 0.94 1000
model = AdaBoostClassifier(random_state=0)
param_grid = {
'n_estimators': [100, 150, 200, 250]
}
grid_search = GridSearchCV(model, param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train_tune, y_train_tune)
best_ada_b = grid_search.best_estimator_
best_ada_b
AdaBoostClassifier(n_estimators=100, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(n_estimators=100, random_state=0)
best_ada_b.fit(X_train_balanced, y_train_balanced)
scores(best_ada_b)
Classification Report for test data:
precision recall f1-score support
0.0 0.94 0.89 0.91 855
1.0 0.49 0.65 0.56 145
accuracy 0.85 1000
macro avg 0.71 0.77 0.73 1000
weighted avg 0.87 0.85 0.86 1000
model = GBC(random_state=0)
param_grid = {
'n_estimators': [100, 150, 200, 250]
}
grid_search = GridSearchCV(model, param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train_tune, y_train_tune)
best_gbc_b = grid_search.best_estimator_
best_gbc_b.fit(X_train_balanced, y_train_balanced)
scores(best_gbc_b)
Classification Report for test data:
precision recall f1-score support
0.0 0.97 0.96 0.96 855
1.0 0.77 0.82 0.80 145
accuracy 0.94 1000
macro avg 0.87 0.89 0.88 1000
weighted avg 0.94 0.94 0.94 1000
models = [best_bag_b,best_rf_b, best_ada_b, best_gbc_b]
model_names = ['Bagging','Random Forest','Ada Booster','Gradient Boosting']
plt.figure(figsize=(8, 8))
for model, name in zip(models, model_names):
y_pred_proba = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, lw=2, label=f'{name} (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], linestyle='--', color='gray', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()
a.5) Support vector machines¶
Support Vector Machines (SVMs) are supervised learning models used for classification and regression tasks. SVMs can handle linear and non-linearly data by using kernel functions (e.g., linear, polynomial, radial basis function - RBF).
In this work we used the SVC class from scikit-learn to create an SVM model and tested with the kernel linear and RBF functions and used a confusion matrix to analyze the model's ability to correctly predict churn and non-churn instances. As previously performed for the Bayesian classifier we also scaled the numerical features using StandardScaler (normally distributed features) and MinMaxScaler (features that don’t follow a normal distribution).
X_train_balanced = X_train_balanced.loc[:, X_train_balanced.columns != 'churn']
y_train_balanced = y_train_balanced.astype(int)
scaler_standard = StandardScaler()
numeric_features = ['accountlength', 'totaldayminutes', 'totaldaycalls', 'totaldaycharge',
'totaleveminutes', 'totalevecalls', 'totalevecharge', 'totalnightminutes', 'totalnightcalls',
'totalnightcharge', 'totalintlminutes', 'totalintlcharge']
scaler_minmax = MinMaxScaler()
numeric_features_minmax = ['numbervmailmessages', 'totalintlcalls', 'numbercustomerservicecalls']
X_numeric_train_balanced = X_train_balanced[numeric_features]
X_numeric_train_balanced_scaled = scaler_standard.fit_transform(X_numeric_train_balanced)
X_train_balanced_scaled = X_train_balanced.copy()
X_train_balanced_scaled[numeric_features] = X_numeric_train_balanced_scaled
X_numeric_minmax = X_train_balanced_scaled[numeric_features_minmax]
X_numeric_minmax_scaled = scaler_minmax.fit_transform(X_numeric_minmax)
X_train_balanced_scaled[numeric_features_minmax] = X_numeric_minmax_scaled
# Linear Kernel
svm_model_lb = SVC(kernel='linear', C=100)
svm_model_lb.fit(X_train_balanced_scaled, y_train_balanced)
train_preds = svm_model_lb.predict(X_train_balanced_scaled)
X_train_balanced_accuracy = accuracy_score(y_train_balanced, train_preds)
print(f"Accuracy on X_train_balanced data: {X_train_balanced_accuracy:.2f}")
print("Classification Report for X_train_balanced data:")
print(classification_report(y_train_balanced, train_preds, zero_division=0))
correct_predictions = (train_preds == y_train_balanced)
num_correct = sum(correct_predictions)
print(f"Number of correct predictions: {num_correct}")
train_conf_matrix = confusion_matrix(y_train_balanced, train_preds)
plt.figure(figsize=(4, 4))
sns.heatmap(train_conf_matrix, annot=True, fmt='g', cmap='Blues',
square=True,
xticklabels=['No', 'Yes'],
yticklabels=['No', 'Yes'])
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix of SMV Linear Kernel')
plt.show()
Accuracy on X_train_balanced data: 0.80
Classification Report for X_train_balanced data:
precision recall f1-score support
0 0.82 0.76 0.79 1719
1 0.78 0.84 0.80 1719
accuracy 0.80 3438
macro avg 0.80 0.80 0.80 3438
weighted avg 0.80 0.80 0.80 3438
Number of correct predictions: 2741
The model attained a reasonable accuracy of 79% on the balanced training data. It displays an 80% precision in identifying class not churned, but with a 78% recall, indicating it might miss some actual instances of this class. For class churned, precision stands at 77%, with a slightly better recall of 81%. The f1-scores, which balance precision and recall, hover around 78% and 79% for classes not churned and churn, respectively. This suggests a balanced model performance, with nearly equal predictive accuracies across both classes.
# RBF Kernel
svm_model_rbf_b = SVC(kernel='rbf', C=100)
svm_model_rbf_b.fit(X_train_balanced_scaled, y_train_balanced)
train_preds = svm_model_rbf_b.predict(X_train_balanced_scaled)
X_train_balanced_accuracy = accuracy_score(y_train_balanced, train_preds)
print(f"Accuracy on X_train_balanced data: {X_train_balanced_accuracy:.2f}")
print("Classification Report for X_train_balanced data:")
print(classification_report(y_train_balanced, train_preds, zero_division=0))
correct_predictions = (train_preds == y_train_balanced)
num_correct = sum(correct_predictions)
print(f"Number of correct predictions: {num_correct}")
train_conf_matrix = confusion_matrix(y_train_balanced, train_preds)
plt.figure(figsize=(4, 4))
sns.heatmap(train_conf_matrix, annot=True, fmt='g', cmap='Blues',
square=True,
xticklabels=['No', 'Yes'],
yticklabels=['No', 'Yes'])
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix of SMV RBF Kernel')
plt.show()
Accuracy on X_train_balanced data: 1.00
Classification Report for X_train_balanced data:
precision recall f1-score support
0 1.00 1.00 1.00 1719
1 1.00 1.00 1.00 1719
accuracy 1.00 3438
macro avg 1.00 1.00 1.00 3438
weighted avg 1.00 1.00 1.00 3438
Number of correct predictions: 3426
The model displays an exceptional accuracy of 100% on the balanced training data, reflecting a very high correctness in predictions across both classes. This result is corroborated by the precision, recall, and f1-scores of 100% for both classes, indicating an balanced performance for the given dataset.
cv = KFold(n_splits=5, shuffle=True, random_state=42)
svm_model_rbf_b = SVC(kernel='rbf', C=100)
scores_rbf_b = cross_val_score(svm_model_rbf_b, X_train_balanced_scaled, y_train_balanced, cv=cv, scoring='accuracy')
print("Cross-validation scores for RBF Kernel:", scores_rbf_b)
print("Mean accuracy for RBF Kernel:", np.mean(scores_rbf_b))
svm_model_lb = SVC(kernel='linear', C=100)
scores_lb = cross_val_score(svm_model_lb, X_train_balanced_scaled, y_train_balanced, cv=cv, scoring='accuracy')
print("Cross-validation scores for Linear Kernel:", scores_lb)
print("Mean accuracy for Linear Kernel:", np.mean(scores_lb))
Cross-validation scores for RBF Kernel: [0.90988372 0.8997093 0.90261628 0.89665211 0.88500728] Mean accuracy for RBF Kernel: 0.898773738194374 Cross-validation scores for Linear Kernel: [0.80377907 0.79215116 0.77180233 0.80494905 0.79039301] Mean accuracy for Linear Kernel: 0.7926149250194644
These results suggest that the RBF kernel performs significantly better on average compared to the Linear kernel, as indicated by the higher mean accuracy observed of the RBF kernel.
a.6) Neural Network Classifier¶
To create a Neural Network Classifier, we started by creating a model with no defined parameters. We fit this model to our training data and used it to make predictions of our target variable (y_pred) using our test variables (X_test).
clf = MLPClassifier().fit(X_train_balanced, y_train_balanced)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
Accuracy: 0.679 Precision: 0.28743961352657005 Recall: 0.8206896551724138 F1 Score: 0.42576028622540246
This model has an accuracy of 0.856, and the precision and recall of this model are low, which in turn gives us a low f1 score.
We created a hyperparameter grid to search for the best hidden layer sizes. We tested a few options of hidden layer sizes to demostrate how the search grid works, however, more options can be tested. The more options we test, the more likelyhood of getting a higher performing model, but it will take a longer time to run. The performance of the models will be evaluated using accuracy.
mlp_gs = MLPClassifier(max_iter=100)
parameter_space = {
'hidden_layer_sizes': [(50,), (100,), (200,), (10, 20), (10, 50, 10)],
'activation': ['tanh', 'relu'],
'solver': ['sgd', 'adam'],
'alpha': [0.0001, 0.05],
'learning_rate': ['constant','adaptive'],
}
clf = GridSearchCV(mlp_gs, parameter_space, n_jobs=-1, cv=5)
clf.fit(X_train_balanced, y_train_balanced)
C:\Users\Manuela\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:691: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (100) reached and the optimization hasn't converged yet. warnings.warn(
GridSearchCV(cv=5, estimator=MLPClassifier(max_iter=100), n_jobs=-1,
param_grid={'activation': ['tanh', 'relu'],
'alpha': [0.0001, 0.05],
'hidden_layer_sizes': [(50,), (100,), (200,), (10, 20),
(10, 50, 10)],
'learning_rate': ['constant', 'adaptive'],
'solver': ['sgd', 'adam']})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=MLPClassifier(max_iter=100), n_jobs=-1,
param_grid={'activation': ['tanh', 'relu'],
'alpha': [0.0001, 0.05],
'hidden_layer_sizes': [(50,), (100,), (200,), (10, 20),
(10, 50, 10)],
'learning_rate': ['constant', 'adaptive'],
'solver': ['sgd', 'adam']})MLPClassifier(max_iter=100)
MLPClassifier(max_iter=100)
print('Best parameters found:\n', clf.best_params_)
Best parameters found:
{'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'adam'}
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
0.639 (+/-0.031) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.771 (+/-0.022) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'adam'}
0.652 (+/-0.040) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.763 (+/-0.024) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.668 (+/-0.029) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.784 (+/-0.054) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'adam'}
0.686 (+/-0.062) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.785 (+/-0.045) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.689 (+/-0.040) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.807 (+/-0.056) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'adam'}
0.708 (+/-0.033) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.812 (+/-0.058) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.603 (+/-0.069) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'sgd'}
0.619 (+/-0.030) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'adam'}
0.609 (+/-0.116) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.617 (+/-0.125) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.604 (+/-0.052) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'sgd'}
0.621 (+/-0.043) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'adam'}
0.609 (+/-0.028) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.646 (+/-0.083) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.646 (+/-0.053) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.764 (+/-0.051) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'adam'}
0.641 (+/-0.043) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.749 (+/-0.052) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.667 (+/-0.032) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.794 (+/-0.038) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'adam'}
0.667 (+/-0.043) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.807 (+/-0.040) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.698 (+/-0.037) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.816 (+/-0.049) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'adam'}
0.697 (+/-0.048) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.817 (+/-0.050) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.538 (+/-0.066) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'sgd'}
0.678 (+/-0.123) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'adam'}
0.593 (+/-0.096) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.660 (+/-0.102) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.619 (+/-0.067) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'sgd'}
0.651 (+/-0.035) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'adam'}
0.614 (+/-0.088) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.600 (+/-0.082) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.669 (+/-0.028) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.784 (+/-0.057) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'adam'}
0.654 (+/-0.040) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.771 (+/-0.071) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.672 (+/-0.012) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.775 (+/-0.074) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'adam'}
0.676 (+/-0.032) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.772 (+/-0.105) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.691 (+/-0.042) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.805 (+/-0.053) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'adam'}
0.692 (+/-0.042) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.788 (+/-0.068) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.628 (+/-0.127) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'sgd'}
0.645 (+/-0.079) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'adam'}
0.572 (+/-0.138) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.715 (+/-0.076) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.627 (+/-0.131) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'sgd'}
0.691 (+/-0.063) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'adam'}
0.621 (+/-0.119) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.629 (+/-0.148) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.650 (+/-0.060) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.769 (+/-0.034) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'adam'}
0.665 (+/-0.030) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.766 (+/-0.028) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.671 (+/-0.052) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.780 (+/-0.064) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'adam'}
0.675 (+/-0.053) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.776 (+/-0.056) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.691 (+/-0.041) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.808 (+/-0.057) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'adam'}
0.696 (+/-0.051) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.783 (+/-0.048) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.631 (+/-0.082) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'sgd'}
0.700 (+/-0.097) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'adam'}
0.543 (+/-0.122) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.685 (+/-0.099) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.600 (+/-0.154) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'sgd'}
0.721 (+/-0.060) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'adam'}
0.656 (+/-0.041) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.702 (+/-0.125) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'adam'}
clf_b = MLPClassifier(activation= 'tanh', alpha = 0.05, hidden_layer_sizes = (200,), learning_rate = 'adaptive', solver = 'adam').fit(X_train_balanced, y_train_balanced)
y_pred = clf_b.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
Accuracy: 0.795 Precision: 0.38549618320610685 Recall: 0.696551724137931 F1 Score: 0.4963144963144963
As we can see, creating a hyperparameter grid allowed us to select the best hidden layer sizes amongst the ones we tested. The new model has an accuracy of 0.814, which is similar than our first models’ accuracy, 0.856, which had no defined parameters. The recall and precision, as well as the f1 score, have similar values.
print("Classification Report for X_train_balanced data:")
print(classification_report(y_test, y_pred, zero_division=0))
correct_predictions = (y_pred == y_test)
num_correct = sum(correct_predictions)
print(f"Number of correct predictions: {num_correct}")
Classification Report for X_train_balanced data:
precision recall f1-score support
0.0 0.94 0.81 0.87 855
1.0 0.39 0.70 0.50 145
accuracy 0.80 1000
macro avg 0.66 0.75 0.68 1000
weighted avg 0.86 0.80 0.82 1000
Number of correct predictions: 795
NNC_conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(4, 4))
sns.heatmap(KNN_conf_matrix, annot=True, fmt='g', cmap='Blues',
square=True,
xticklabels=['No', 'Yes'],
yticklabels=['No', 'Yes'])
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix of NNC')
plt.show()
b) Modelling with the unbalanced dataset¶
b.1) Nearest neighbor¶
scaler = StandardScaler()
X_train_scaler = scaler.fit_transform(X_train)
X_test_scaler = scaler.transform(X_test)
To start training the model we used a fixed value of k=2, which will be later optimized.
knn = KNeighborsClassifier(n_neighbors=2)
knn.fit(X_train_scaler, y_train)
KNeighborsClassifier(n_neighbors=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=2)
y_pred = knn.predict(X_test_scaler)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Accuracy: 0.88
We may be able to get a better accuracy value by optimizing our value of k. In order to do this, we loop through many different k values.
k_values = [i for i in range (1,30)]
scores = []
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train_scaler)
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train_scaler, y_train)
y_pred = knn.predict(X_test_scaler)
score = accuracy_score(y_test, y_pred)
scores.append(np.mean(score))
sns.lineplot(x = k_values, y = scores, marker = 'o')
plt.xlabel("K Values")
plt.ylabel("Accuracy Score")
Text(0, 0.5, 'Accuracy Score')
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_scaler, y_train)
y_pred = knn.predict(X_test_scaler)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Accuracy: 0.898
We can see from our chart that k = 3 and k = 7 appear to have the highest accuracy scores. After calculating their individual accuracies, K = 3 has a slightly highest value, so this is the value we will use.
These are the accuracy, precision, and recall scores for our model, using the best k score of 3:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_scaler, y_train)
y_pred = knn.predict(X_test_scaler)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
Accuracy: 0.898 Precision: 0.7721518987341772 Recall: 0.4206896551724138 F1 Score: 0.5446428571428571
As we can see, the accuracy value is higher using k=3 than it was when we initially used k=5. Although the accuracy of this model is considerably high, precision and recall are not, with recall being under 50%, which is reflected on a low f1 score.
KNN_conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(4, 4))
sns.heatmap(KNN_conf_matrix, annot=True, fmt='g', cmap='Blues',
square=True,
xticklabels=['No', 'Yes'],
yticklabels=['No', 'Yes'])
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix of KNN')
plt.show()
b.2) Bayesian classifier¶
X_train = train_data.loc[:, train_data.columns != 'churn']
y_train = train_data['churn'].astype(int)
X_test = test_data.loc[:, test_data.columns != 'churn']
y_test = test_data['churn'].astype(int)
scaler_standard = StandardScaler()
numeric_features = ['accountlength', 'totaldayminutes', 'totaldaycalls', 'totaldaycharge',
'totaleveminutes', 'totalevecalls', 'totalevecharge', 'totalnightminutes', 'totalnightcalls',
'totalnightcharge', 'totalintlminutes', 'totalintlcharge']
scaler_minmax = MinMaxScaler()
numeric_features_minmax = ['numbervmailmessages', 'totalintlcalls', 'numbercustomerservicecalls']
X_numeric_train = X_train[numeric_features]
X_numeric_train_scaled = scaler_standard.fit_transform(X_numeric_train)
X_train_scaled = X_train.copy()
X_train_scaled[numeric_features] = X_numeric_train_scaled
X_numeric_minmax = X_train_scaled[numeric_features_minmax]
X_numeric_minmax_scaled = scaler_minmax.fit_transform(X_numeric_minmax)
X_train_scaled[numeric_features_minmax] = X_numeric_minmax_scaled
X_numeric_test = X_test[numeric_features]
X_numeric_test_scaled = scaler_standard.transform(X_numeric_test)
X_test_scaled = X_test.copy()
X_test_scaled[numeric_features] = X_numeric_test_scaled
X_numeric_test_minmax = X_test_scaled[numeric_features_minmax]
X_numeric_test_minmax_scaled = scaler_minmax.transform(X_numeric_test_minmax)
X_test_scaled[numeric_features_minmax] = X_numeric_test_minmax_scaled
model_NB = MixedNB()
model_NB.fit(X_train_scaled, y_train)
preds = model_NB.predict(X_train_scaled)
accuracy = accuracy_score(y_train, preds)
print(f"Accuracy on train data: {accuracy:.2f}")
print("Classification Report for train data:")
print(classification_report(y_train, preds, zero_division=0))
correct_predictions = (preds == y_train)
num_correct = sum(correct_predictions)
print(f"Number of correct predictions: {num_correct}")
conf_matrix = confusion_matrix(y_train, preds)
plt.figure(figsize=(4, 4))
sns.heatmap(conf_matrix, annot=True, fmt='g', cmap='Blues',
square=True,
xticklabels=['No', 'Yes'],
yticklabels=['No', 'Yes'])
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()
preds = model_NB.predict(X_test_scaled)
accuracy = accuracy_score(y_test, preds)
print(f"Accuracy on test data: {accuracy:.2f}")
print("Classification Report for test data:")
print(classification_report(y_test, preds, zero_division=0))
correct_predictions = (preds == y_test)
num_correct = sum(correct_predictions)
print(f"Number of correct predictions: {num_correct}")
conf_matrix = confusion_matrix(y_test, preds)
plt.figure(figsize=(4, 4))
sns.heatmap(conf_matrix, annot=True, fmt='g', cmap='Blues',
square=True,
xticklabels=['No', 'Yes'],
yticklabels=['No', 'Yes'])
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix of NB')
plt.show()
Accuracy on train data: 0.88
Classification Report for train data:
precision recall f1-score support
0 0.92 0.94 0.93 3438
1 0.58 0.53 0.55 562
accuracy 0.88 4000
macro avg 0.75 0.73 0.74 4000
weighted avg 0.88 0.88 0.88 4000
Number of correct predictions: 3520
Accuracy on test data: 0.87
Classification Report for test data:
precision recall f1-score support
0 0.92 0.93 0.92 855
1 0.56 0.52 0.54 145
accuracy 0.87 1000
macro avg 0.74 0.73 0.73 1000
weighted avg 0.87 0.87 0.87 1000
Number of correct predictions: 871
The results obtained seem to reflect that the dataset is class imbalanced, where the number of samples for the churned class is considerably smaller than the not churned class, probably affecting the classifier's ability to predict the minority class accurately.
As a result, the model seems to perform better in predicting the not churned class than the churned class for both precision and recall, indicating that the model has more difficulty in correctly identifying churn cases.
Nonetheless, the model seems to generalize reasonably well from training to test data, with a relatively consistent drop in performance but still maintaining similar values of accuracy, precision, recall, and f1-scores on the test set as compared to the training set.
b.3) Decision Tree¶
model = DecisionTreeClassifier(random_state = 0)
param_grid = {
'max_depth': [None,4,8,14,16,20,22],
'min_samples_split': [15,20,25,30,],
'ccp_alpha':[0.0001,0.001,0.01]
}
grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train, y_train)
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=0),
param_grid={'ccp_alpha': [0.0001, 0.001, 0.01],
'max_depth': [None, 4, 8, 14, 16, 20, 22],
'min_samples_split': [15, 20, 25, 30]})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=0),
param_grid={'ccp_alpha': [0.0001, 0.001, 0.01],
'max_depth': [None, 4, 8, 14, 16, 20, 22],
'min_samples_split': [15, 20, 25, 30]})DecisionTreeClassifier(random_state=0)
DecisionTreeClassifier(random_state=0)
best_dt_model = grid_search.best_estimator_
best_dt_model
DecisionTreeClassifier(ccp_alpha=0.001, max_depth=16, min_samples_split=25,
random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(ccp_alpha=0.001, max_depth=16, min_samples_split=25,
random_state=0)plt.subplots(nrows = 1,ncols = 1,figsize = (14,14), dpi=200)
tree.plot_tree(best_dt_model, fontsize=5, filled=True)
plt.title('Decision Tree with the best parameters defined from a Cross Validation Method')
plt.show()
def scores(model):
preds = model.predict(X_test)
print("Classification Report for test data:")
print(classification_report(y_test, preds, zero_division=0))
scores(best_dt_model)
Classification Report for test data:
precision recall f1-score support
0.0 0.95 0.99 0.97 855
1.0 0.90 0.67 0.77 145
accuracy 0.94 1000
macro avg 0.92 0.83 0.87 1000
weighted avg 0.94 0.94 0.94 1000
preds = best_dt_model.predict(X_test)
cm = confusion_matrix(y_test, preds)
plt.figure(figsize=(4, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=True)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix of Decision Tree Model')
plt.show()
model_f1 = DecisionTreeClassifier(random_state = 0)
grid_search_f1 = GridSearchCV(model_f1, param_grid, scoring='f1', cv=5)
grid_search_f1.fit(X_train, y_train)
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=0),
param_grid={'ccp_alpha': [0.0001, 0.001, 0.01],
'max_depth': [None, 4, 8, 14, 16, 20, 22],
'min_samples_split': [15, 20, 25, 30]},
scoring='f1')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=0),
param_grid={'ccp_alpha': [0.0001, 0.001, 0.01],
'max_depth': [None, 4, 8, 14, 16, 20, 22],
'min_samples_split': [15, 20, 25, 30]},
scoring='f1')DecisionTreeClassifier(random_state=0)
DecisionTreeClassifier(random_state=0)
best_dt_model_f1 = grid_search_f1.best_estimator_
best_dt_model_f1
DecisionTreeClassifier(ccp_alpha=0.001, max_depth=16, min_samples_split=25,
random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(ccp_alpha=0.001, max_depth=16, min_samples_split=25,
random_state=0)scores(best_dt_model_f1)
Classification Report for test data:
precision recall f1-score support
0.0 0.95 0.99 0.97 855
1.0 0.90 0.67 0.77 145
accuracy 0.94 1000
macro avg 0.92 0.83 0.87 1000
weighted avg 0.94 0.94 0.94 1000
preds = best_dt_model_f1.predict(X_test)
cm = confusion_matrix(y_test, preds)
plt.figure(figsize=(4, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=True)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix of the F1 Tunned Decision Tree Model')
plt.show()
b.4) Tree Ensembles Methods¶
X_train_tune, X_train_final, y_train_tune, y_train_final = train_test_split(X_train, y_train, test_size=0.8, random_state=0)
bag = BaggingClassifier(random_state=0)
param_grid = {
'n_estimators': [100, 150, 200, 250]
}
grid_search = GridSearchCV(bag, param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train_tune, y_train_tune)
best_bag = grid_search.best_estimator_
best_bag
BaggingClassifier(n_estimators=200, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BaggingClassifier(n_estimators=200, random_state=0)
base_tree = DecisionTreeClassifier()
bag = BaggingClassifier(n_estimators=200, random_state=0)
bag.fit(X_train, y_train)
scores(bag)
Classification Report for test data:
precision recall f1-score support
0.0 0.96 0.99 0.97 855
1.0 0.90 0.73 0.81 145
accuracy 0.95 1000
macro avg 0.93 0.86 0.89 1000
weighted avg 0.95 0.95 0.95 1000
rf = RandomForestClassifier(random_state=0)
param_grid = {
'n_estimators': [100, 150, 200, 250],
'max_features': [None, 'sqrt', 'log2']
}
grid_search = GridSearchCV(rf, param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train_tune, y_train_tune)
best_rf = grid_search.best_estimator_
best_rf
RandomForestClassifier(max_features=None, n_estimators=250, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_features=None, n_estimators=250, random_state=0)
rf = RandomForestClassifier(n_estimators=200, random_state=0)
rf.fit(X_train,y_train)
scores(rf)
Classification Report for test data:
precision recall f1-score support
0.0 0.96 0.99 0.97 855
1.0 0.91 0.75 0.82 145
accuracy 0.95 1000
macro avg 0.93 0.87 0.90 1000
weighted avg 0.95 0.95 0.95 1000
model = AdaBoostClassifier(random_state=0)
param_grid = {
'n_estimators': [100, 150, 200, 250]
}
grid_search = GridSearchCV(model, param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train_tune, y_train_tune)
best_ada = grid_search.best_estimator_
best_ada
AdaBoostClassifier(n_estimators=100, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(n_estimators=100, random_state=0)
adaboosted_tree = AdaBoostClassifier(n_estimators=150, random_state=0)
adaboosted_tree.fit(X_train, y_train)
scores(adaboosted_tree)
Classification Report for test data:
precision recall f1-score support
0.0 0.90 0.96 0.93 855
1.0 0.61 0.37 0.46 145
accuracy 0.87 1000
macro avg 0.75 0.66 0.69 1000
weighted avg 0.86 0.87 0.86 1000
model = GBC(random_state=0)
param_grid = {
'n_estimators': [100, 150, 200, 250]
}
grid_search = GridSearchCV(model, param_grid, cv=3, scoring='accuracy')
grid_search.fit(X_train_tune, y_train_tune)
best_gbc = grid_search.best_estimator_
best_params = grid_search.best_params_
best_params
{'n_estimators': 150}
gbc_boosted_tree = GBC(n_estimators=100, random_state=0)
gbc_boosted_tree.fit(X_train, y_train)
scores(gbc_boosted_tree)
Classification Report for test data:
precision recall f1-score support
0.0 0.96 0.99 0.97 855
1.0 0.91 0.75 0.82 145
accuracy 0.95 1000
macro avg 0.93 0.87 0.90 1000
weighted avg 0.95 0.95 0.95 1000
b.6) Support vector machines¶
X_train = train_data.loc[:, train_data.columns != 'churn']
y_train = train_data['churn'].astype(int)
X_test = test_data.loc[:, test_data.columns != 'churn']
y_test = test_data['churn'].astype(int)
scaler_standard = StandardScaler()
numeric_features = ['accountlength', 'totaldayminutes', 'totaldaycalls', 'totaldaycharge',
'totaleveminutes', 'totalevecalls', 'totalevecharge', 'totalnightminutes', 'totalnightcalls',
'totalnightcharge', 'totalintlminutes', 'totalintlcharge']
scaler_minmax = MinMaxScaler()
numeric_features_minmax = ['numbervmailmessages', 'totalintlcalls', 'numbercustomerservicecalls']
X_numeric_train = X_train[numeric_features]
X_numeric_train_scaled = scaler_standard.fit_transform(X_numeric_train)
X_train_scaled = X_train.copy()
X_train_scaled[numeric_features] = X_numeric_train_scaled
X_numeric_minmax = X_train_scaled[numeric_features_minmax]
X_numeric_minmax_scaled = scaler_minmax.fit_transform(X_numeric_minmax)
X_train_scaled[numeric_features_minmax] = X_numeric_minmax_scaled
X_numeric_test = X_test[numeric_features]
X_numeric_test_scaled = scaler_standard.transform(X_numeric_test)
X_test_scaled = X_test.copy()
X_test_scaled[numeric_features] = X_numeric_test_scaled
X_numeric_test_minmax = X_test_scaled[numeric_features_minmax]
X_numeric_test_minmax_scaled = scaler_minmax.transform(X_numeric_test_minmax)
X_test_scaled[numeric_features_minmax] = X_numeric_test_minmax_scaled
# Linear Kernel
svm_model_l = SVC(kernel='linear', C=100)
svm_model_l.fit(X_train_scaled, y_train)
train_preds = svm_model_l.predict(X_train_scaled)
train_accuracy = accuracy_score(y_train, train_preds)
print(f"Accuracy on train data: {train_accuracy:.2f}")
print("Classification Report for train data:")
print(classification_report(y_train, train_preds, zero_division=0))
correct_predictions = (train_preds == y_train)
num_correct = sum(correct_predictions)
print(f"Number of correct predictions: {num_correct}")
train_conf_matrix = confusion_matrix(y_train, train_preds)
plt.figure(figsize=(4, 4))
sns.heatmap(train_conf_matrix, annot=True, fmt='g', cmap='Blues',
square=True,
xticklabels=['No', 'Yes'],
yticklabels=['No', 'Yes'])
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()
test_preds = svm_model_l.predict(X_test_scaled)
test_accuracy = accuracy_score(y_test, test_preds)
print(f"Accuracy on test data: {test_accuracy:.2f}")
print("Classification Report for test data:")
print(classification_report(y_test, test_preds, zero_division=0))
correct_predictions_test = (test_preds == y_test)
num_correct_test = sum(correct_predictions_test)
print(f"Number of correct predictions on test data: {num_correct_test}")
test_conf_matrix = confusion_matrix(y_test, test_preds)
plt.figure(figsize=(4, 4))
sns.heatmap(test_conf_matrix, annot=True, fmt='g', cmap='Blues',
square=True,
xticklabels=['No', 'Yes'],
yticklabels=['No', 'Yes'])
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix of SMV Linear Kernel')
plt.show()
Accuracy on train data: 0.86
Classification Report for train data:
precision recall f1-score support
0 0.86 1.00 0.92 3438
1 0.00 0.00 0.00 562
accuracy 0.86 4000
macro avg 0.43 0.50 0.46 4000
weighted avg 0.74 0.86 0.79 4000
Number of correct predictions: 3438
Accuracy on test data: 0.85
Classification Report for test data:
precision recall f1-score support
0 0.85 1.00 0.92 855
1 0.00 0.00 0.00 145
accuracy 0.85 1000
macro avg 0.43 0.50 0.46 1000
weighted avg 0.73 0.85 0.79 1000
Number of correct predictions on test data: 855
The linear kernel SVM model achieved similar accuracy between train (86%) and test data (85%). However, the Precision, Recall and f1-scores were 0% for both datasets for class churned, indicating that the model didn't correctly predict any instances of churned customers.
This might reflect the imbalance in the dataset, leading the model to focus more on the majority class, affecting its ability to predict the minority class accurately. Additionally, this may indicate the model ineffectiveness for churn prediction, as the model trained with a linear kernel might not capture the complex relationships associated with churn and alternative algorithms to enhance the model's ability to predict churn accurately should be useded.
svm_model_rbf = SVC(kernel='rbf', C=100)
svm_model_rbf.fit(X_train_scaled, y_train)
train_preds = svm_model_rbf.predict(X_train_scaled)
train_accuracy = accuracy_score(y_train, train_preds)
print(f"Accuracy on train data: {train_accuracy:.2f}")
print("Classification Report for train data:")
print(classification_report(y_train, train_preds, zero_division=0))
correct_predictions = (train_preds == y_train)
num_correct = sum(correct_predictions)
print(f"Number of correct predictions: {num_correct}")
train_conf_matrix = confusion_matrix(y_train, train_preds)
plt.figure(figsize=(4, 4))
sns.heatmap(train_conf_matrix, annot=True, fmt='g', cmap='Blues',
square=True,
xticklabels=['No', 'Yes'],
yticklabels=['No', 'Yes'])
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()
test_preds = svm_model_rbf.predict(X_test_scaled)
test_accuracy = accuracy_score(y_test, test_preds)
print(f"Accuracy on test data: {test_accuracy:.2f}")
print("Classification Report for test data:")
print(classification_report(y_test, test_preds, zero_division=0))
correct_predictions_test = (test_preds == y_test)
num_correct_test = sum(correct_predictions_test)
print(f"Number of correct predictions on test data: {num_correct_test}")
test_conf_matrix = confusion_matrix(y_test, test_preds)
plt.figure(figsize=(4, 4))
sns.heatmap(test_conf_matrix, annot=True, fmt='g', cmap='Blues',
square=True,
xticklabels=['No', 'Yes'],
yticklabels=['No', 'Yes'])
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix of SMV RBF Kernel')
plt.show()
Accuracy on train data: 0.99
Classification Report for train data:
precision recall f1-score support
0 0.99 1.00 1.00 3438
1 1.00 0.94 0.97 562
accuracy 0.99 4000
macro avg 0.99 0.97 0.98 4000
weighted avg 0.99 0.99 0.99 4000
Number of correct predictions: 3967
Accuracy on test data: 0.88
Classification Report for test data:
precision recall f1-score support
0 0.93 0.93 0.93 855
1 0.59 0.57 0.58 145
accuracy 0.88 1000
macro avg 0.76 0.75 0.75 1000
weighted avg 0.88 0.88 0.88 1000
Number of correct predictions on test data: 880
The RBF kernel SVM model achieved an exceptional overall accuracy of 99%, presenting a high Precision (99% and 100% for not churned and churn, respectively), Recall (100% and 94% for not churned and churned, respectively), and f1-scores (100% and 97% for not churned and churn, respectively) for both classes.The model seems to effectively separate churned and not-churned instances, achieving near-perfect precision and recall scores.
Achieving such high accuracy on the training set might raise concerns about potential overfitting, being necessary to validate this model on a separate test dataset to assess its generalization ability.
However, when we tested the RBF kernel SVM model on the test dataset the accuracy dropped to 88%, with Precision, Recall and f1-scores dropping for the class churned, indicating that the model didn't correctly predict the minority class accurately.
b.7) Neural Network Classifier¶
To create a Neural Network Classifier, we started by creating a model with no defined parameters. We fit this model to our training data and used it to make predictions of our target variable (y_pred) using our test variables (X_test).
clf = MLPClassifier().fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
Accuracy: 0.873 Precision: 0.6730769230769231 Recall: 0.2413793103448276 F1 Score: 0.3553299492385787
This model has an accuracy of 0.865, however the precision and recall of this model are low, which in turn gives us a low f1 score.
We created a hyperparameter grid to search for the best hidden layer sizes. We tested a few options of hidden layer sizes to demostrate how the search grid works, however, more options can be tested. The more options we test, the more likelyhood of getting a higher performing model, but it will take a longer time to run. The performance of the models will be evaluated using accuracy.
mlp_gs = MLPClassifier(max_iter=100)
parameter_space = {
'hidden_layer_sizes': [(50,), (100,), (200,), (10, 20), (10, 50, 10)],
'activation': ['tanh', 'relu'],
'solver': ['sgd', 'adam'],
'alpha': [0.0001, 0.05],
'learning_rate': ['constant','adaptive'],
}
from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(mlp_gs, parameter_space, n_jobs=-1, cv=5)
clf.fit(X_train, y_train)
GridSearchCV(cv=5, estimator=MLPClassifier(max_iter=100), n_jobs=-1,
param_grid={'activation': ['tanh', 'relu'],
'alpha': [0.0001, 0.05],
'hidden_layer_sizes': [(50,), (100,), (200,), (10, 20),
(10, 50, 10)],
'learning_rate': ['constant', 'adaptive'],
'solver': ['sgd', 'adam']})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=MLPClassifier(max_iter=100), n_jobs=-1,
param_grid={'activation': ['tanh', 'relu'],
'alpha': [0.0001, 0.05],
'hidden_layer_sizes': [(50,), (100,), (200,), (10, 20),
(10, 50, 10)],
'learning_rate': ['constant', 'adaptive'],
'solver': ['sgd', 'adam']})MLPClassifier(max_iter=100)
MLPClassifier(max_iter=100)
print('Best parameters found:\n', clf.best_params_)
Best parameters found:
{'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'adam'}
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
0.860 (+/-0.003) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.862 (+/-0.009) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'adam'}
0.861 (+/-0.009) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.866 (+/-0.006) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.862 (+/-0.005) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.865 (+/-0.006) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'adam'}
0.860 (+/-0.005) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.868 (+/-0.011) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.869 (+/-0.011) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.868 (+/-0.006) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'adam'}
0.869 (+/-0.005) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.871 (+/-0.012) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.860 (+/-0.002) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'sgd'}
0.860 (+/-0.002) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'adam'}
0.860 (+/-0.002) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.861 (+/-0.010) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.860 (+/-0.001) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'sgd'}
0.860 (+/-0.003) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'adam'}
0.859 (+/-0.002) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.861 (+/-0.012) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.861 (+/-0.005) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.866 (+/-0.010) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'adam'}
0.861 (+/-0.010) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.867 (+/-0.007) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.866 (+/-0.006) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.868 (+/-0.009) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'adam'}
0.863 (+/-0.011) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.868 (+/-0.019) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.866 (+/-0.003) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.864 (+/-0.005) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'adam'}
0.869 (+/-0.008) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.869 (+/-0.012) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.860 (+/-0.001) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'sgd'}
0.860 (+/-0.005) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'adam'}
0.859 (+/-0.001) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.860 (+/-0.001) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.860 (+/-0.003) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'sgd'}
0.862 (+/-0.006) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'adam'}
0.860 (+/-0.001) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.861 (+/-0.008) for {'activation': 'tanh', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.865 (+/-0.010) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.856 (+/-0.017) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'adam'}
0.863 (+/-0.008) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.864 (+/-0.014) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.864 (+/-0.010) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.863 (+/-0.007) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'adam'}
0.866 (+/-0.007) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.868 (+/-0.008) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.868 (+/-0.015) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.857 (+/-0.042) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'adam'}
0.869 (+/-0.014) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.867 (+/-0.012) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.860 (+/-0.001) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'sgd'}
0.865 (+/-0.010) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'adam'}
0.859 (+/-0.001) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.867 (+/-0.005) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.862 (+/-0.005) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'sgd'}
0.871 (+/-0.008) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'adam'}
0.860 (+/-0.002) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.862 (+/-0.015) for {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.867 (+/-0.010) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.854 (+/-0.018) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'constant', 'solver': 'adam'}
0.859 (+/-0.005) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.858 (+/-0.028) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (50,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.869 (+/-0.009) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.863 (+/-0.025) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'adam'}
0.867 (+/-0.008) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.855 (+/-0.032) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (100,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.865 (+/-0.015) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'sgd'}
0.866 (+/-0.015) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'constant', 'solver': 'adam'}
0.869 (+/-0.010) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.865 (+/-0.010) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (200,), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.862 (+/-0.011) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'sgd'}
0.862 (+/-0.008) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'constant', 'solver': 'adam'}
0.863 (+/-0.015) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.865 (+/-0.013) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 20), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.865 (+/-0.011) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'sgd'}
0.866 (+/-0.004) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'constant', 'solver': 'adam'}
0.862 (+/-0.005) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.865 (+/-0.011) for {'activation': 'relu', 'alpha': 0.05, 'hidden_layer_sizes': (10, 50, 10), 'learning_rate': 'adaptive', 'solver': 'adam'}
clf = MLPClassifier(activation= 'relu', alpha = 0.0001, hidden_layer_sizes = (10, 50, 10), learning_rate = 'constant', solver = 'adam').fit(X_train, y_train)
y_pred = clf_b.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
Accuracy: 0.795 Precision: 0.38549618320610685 Recall: 0.696551724137931 F1 Score: 0.4963144963144963
print("Classification Report for X_train data:")
print(classification_report(y_test, y_pred, zero_division=0))
correct_predictions = (y_pred == y_test)
num_correct = sum(correct_predictions)
print(f"Number of correct predictions: {num_correct}")
Classification Report for X_train data:
precision recall f1-score support
0 0.94 0.81 0.87 855
1 0.39 0.70 0.50 145
accuracy 0.80 1000
macro avg 0.66 0.75 0.68 1000
weighted avg 0.86 0.80 0.82 1000
Number of correct predictions: 795
NNC_conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(4, 4))
sns.heatmap(NNC_conf_matrix, annot=True, fmt='g', cmap='Blues',
square=True,
xticklabels=['No', 'Yes'],
yticklabels=['No', 'Yes'])
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix of NNC')
plt.show()
5- Evaluation and Main Conclusions¶
What is your summary of the achieved results? The primary objective of this work was to develop a robust predictive model capable of anticipating customer churn within the telecommunications sector. To tackle the problem, we tested six models: Nearest neighbor, Bayesian Classifier, Decision Trees, Tree ensembles, Support Vector Machines, Neural Network Classifier, to predict customer churn. The results obtained in our work have shown that several of these models presented a high accuracy rate in predicting customer churn, with high precision and recall. Nonetheless, the best prediction model, when trained either with a balanced or unbalanced dataset, was the Tree Ensemble. With a high accuracy score of 95%, and a F1 score of 0,82 presented the best overall performance and was the best at predicting churn.
models = [best_dt_model_f1_b, best_dt_model_f1, best_bag_b, bag, best_rf_b, rf, best_ada_b, adaboosted_tree, best_gbc_b, gbc_boosted_tree]
models_names = ['Decision Tree', 'Decision Tree', 'Bagging', 'Bagging', 'Random Forest', 'Random Forest', 'Ada Booster', 'Ada Booster', 'Gradient Boosting', 'Gradient Boosting']
data = ["Balanced", "Unbalanced", "Balanced", "Unbalanced", "Balanced", "Unbalanced", "Balanced", "Unbalanced", "Balanced", "Unbalanced"]
accuracies = []
recall_scores = []
precision_scores = []
f1_scores = []
for model in models:
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
accuracies.append(accuracy)
recall_scores.append(recall)
precision_scores.append(precision)
f1_scores.append(f1)
comp = {'Model': models_names, 'Data': data, 'Accuracy': accuracies, 'Recall Score': recall_scores, 'Precision Score': precision_scores, 'F1 Score': f1_scores}
comparation = pd.DataFrame(comp)
comparation
| Model | Data | Accuracy | Recall Score | Precision Score | F1 Score | |
|---|---|---|---|---|---|---|
| 0 | Decision Tree | Balanced | 0.910 | 0.820690 | 0.650273 | 0.725610 |
| 1 | Decision Tree | Unbalanced | 0.941 | 0.668966 | 0.898148 | 0.766798 |
| 2 | Bagging | Balanced | 0.940 | 0.834483 | 0.770701 | 0.801325 |
| 3 | Bagging | Unbalanced | 0.949 | 0.731034 | 0.898305 | 0.806084 |
| 4 | Random Forest | Balanced | 0.938 | 0.827586 | 0.764331 | 0.794702 |
| 5 | Random Forest | Unbalanced | 0.953 | 0.751724 | 0.908333 | 0.822642 |
| 6 | Ada Booster | Balanced | 0.851 | 0.648276 | 0.489583 | 0.557864 |
| 7 | Ada Booster | Unbalanced | 0.874 | 0.365517 | 0.609195 | 0.456897 |
| 8 | Gradient Boosting | Balanced | 0.939 | 0.820690 | 0.772727 | 0.795987 |
| 9 | Gradient Boosting | Unbalanced | 0.953 | 0.751724 | 0.908333 | 0.822642 |
import pandas as pd
models = ['knn_b', 'knn', 'clf_b', 'clf', 'model_NB_b', 'model_NB', 'svm_model_lb', 'svm_model_l']
models_names = ['KNN', 'KNN', 'NeuralNetwork', 'NeuralNetwork', 'Bayes', 'Bayes','SVMLinear', 'SVMLinear']
data = ["Balanced", "Unbalanced", "Balanced", "Unbalanced","Balanced", "Unbalanced", "Balanced", "Unbalanced"]
accuracies = []
recall_scores = []
precision_scores = []
f1_scores = []
for model in models:
if model == 'knn_b':
accuracies.append(0.80)
recall_scores.append(0.80)
precision_scores.append(0.87)
f1_scores.append(0.82)
if model == 'knn':
accuracies.append(0.89)
recall_scores.append(0.42)
precision_scores.append(0.77)
f1_scores.append(0.54)
if model == 'clf_b':
accuracies.append(0.80)
recall_scores.append(0.86)
precision_scores.append(0.80)
f1_scores.append(0.82)
if model == 'clf':
accuracies.append(0.80)
recall_scores.append(0.80)
precision_scores.append(0.86)
f1_scores.append(0.82)
if model == 'model_NB_b':
accuracies.append(0.78)
recall_scores.append(0.79)
precision_scores.append(0.78)
f1_scores.append(0.78)
elif model == 'model_NB':
accuracies.append(0.87)
recall_scores.append(0.87)
precision_scores.append(0.87 )
f1_scores.append(0.87)
elif model == 'svm_model_lb':
accuracies.append(0.80)
recall_scores.append(0.80)
precision_scores.append(0.80)
f1_scores.append(0.80)
elif model == "svm_model_l":
accuracies.append(0.85)
recall_scores.append(0.85)
precision_scores.append(0.73)
f1_scores.append(0.79)
comp = {'Model': models_names, 'Data': data, 'Accuracy': accuracies, 'Recall Score': recall_scores, 'Precision Score': precision_scores, 'F1 Score': f1_scores}
comparison = pd.DataFrame(comp)
comparison
| Model | Data | Accuracy | Recall Score | Precision Score | F1 Score | |
|---|---|---|---|---|---|---|
| 0 | KNN | Balanced | 0.80 | 0.80 | 0.87 | 0.82 |
| 1 | KNN | Unbalanced | 0.89 | 0.42 | 0.77 | 0.54 |
| 2 | NeuralNetwork | Balanced | 0.80 | 0.86 | 0.80 | 0.82 |
| 3 | NeuralNetwork | Unbalanced | 0.80 | 0.80 | 0.86 | 0.82 |
| 4 | Bayes | Balanced | 0.78 | 0.79 | 0.78 | 0.78 |
| 5 | Bayes | Unbalanced | 0.87 | 0.87 | 0.87 | 0.87 |
| 6 | SVMLinear | Balanced | 0.80 | 0.80 | 0.80 | 0.80 |
| 7 | SVMLinear | Unbalanced | 0.85 | 0.85 | 0.73 | 0.79 |
In order to see if the combination of the predictions of the different models would improve the results, we created two Hard Voting models (balanced and unbalanced data), aggregating the predictions of best performing models. We did this for some of the models of the balanced and unbalanced data, but the accuracy scores were very similar to the Gradient Boosting Tree Ensemble, our best model, and had lower F1 scores.
voting_model_balanced = VotingClassifier(estimators=[
('best_bag_b', best_bag_b),
('best_rf_b', best_rf_b),
('best_gbc_b', best_gbc_b)
], voting='hard')
voting_model_balanced.fit(X_train_balanced, y_train_balanced)
y_pred_balanced = voting_model_balanced.predict(X_test)
accuracy_balanced = accuracy_score(y_test, y_pred_balanced)
recall_balanced = recall_score(y_test, y_pred_balanced)
precision_balanced = precision_score(y_test, y_pred_balanced)
f1_balanced = f1_score(y_test, y_pred_balanced)
voting_model_unbalanced = VotingClassifier(estimators=[
('bag', bag),
('rf', rf),
('gbc_boosted_tree', gbc_boosted_tree)
], voting='hard')
voting_model_unbalanced.fit(X_train, y_train)
y_pred_unbalanced = voting_model_unbalanced.predict(X_test)
accuracy_unbalanced = accuracy_score(y_test, y_pred_unbalanced)
recall_unbalanced = recall_score(y_test, y_pred_unbalanced)
precision_unbalanced = precision_score(y_test, y_pred_unbalanced)
f1_unbalanced = f1_score(y_test, y_pred_unbalanced)
voting_comp = {
'Data': ['Balanced', 'Unbalanced'],
'Accuracy': [accuracy_balanced, accuracy_unbalanced],
'Recall Score': [recall_balanced, recall_unbalanced],
'Precision Score': [precision_balanced, precision_unbalanced],
'F1 Score': [f1_balanced, f1_unbalanced]
}
voting_comparison = pd.DataFrame(voting_comp)
display(voting_comparison)
| Data | Accuracy | Recall Score | Precision Score | F1 Score | |
|---|---|---|---|---|---|
| 0 | Balanced | 0.944 | 0.827586 | 0.794702 | 0.810811 |
| 1 | Unbalanced | 0.956 | 0.751724 | 0.931624 | 0.832061 |
In order to know if there is any particular characteristic on the data that makes them difficult to classify we have to identify cases that are recurrently mis-classified by most methods:
models = [best_dt_model_f1_b, best_dt_model_f1, best_bag_b, bag, best_rf_b, rf, best_ada_b, adaboosted_tree, best_gbc_b, gbc_boosted_tree]
predictions = pd.DataFrame({'True_Labels': y_test}, index=X_test.index)
for model in models:
model_name = str(model).split('(')[0]
predictions[model_name] = model.predict(X_test)
# Create a DataFrame for misclassified cases with the correct index
misclassified_cases = pd.DataFrame({'True_Labels': y_test}, index=X_test.index)
for model in models:
model_name = str(model).split('(')[0]
misclassified_cases[model_name] = (predictions['True_Labels'] != predictions[model_name]).astype(int)
misclassified_cases['Total_Misclassifications'] = misclassified_cases.iloc[:, 1:].sum(axis=1)
most_missclassified_cases = misclassified_cases[misclassified_cases['Total_Misclassifications'] > 2]
# Use boolean indexing to filter rows in X_test
most_missclassified_features = X_test.loc[most_missclassified_cases.index]
miss_features = test_data.loc[most_missclassified_cases.index]
miss_features
| churn | accountlength | internationalplan | voicemailplan | numbervmailmessages | totaldayminutes | totaldaycalls | totaldaycharge | totaleveminutes | totalevecalls | totalevecharge | totalnightminutes | totalnightcalls | totalnightcharge | totalintlminutes | totalintlcalls | totalintlcharge | numbercustomerservicecalls | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 57 | 1.0 | 121.0 | 0.0 | 1.0 | 30.0 | 198.4 | 129.0 | 33.730000 | 75.30 | 77.0 | 6.40 | 181.2 | 77.0 | 8.150 | 5.80 | 3.0 | 1.570000 | 3.0 |
| 951 | 0.0 | 101.0 | 0.0 | 0.0 | 0.0 | 153.8 | 89.0 | 26.150000 | 234.00 | 89.0 | 19.89 | 196.3 | 77.0 | 8.830 | 11.60 | 2.0 | 3.130000 | 4.0 |
| 2369 | 1.0 | 112.0 | 0.0 | 0.0 | 0.0 | 174.3 | 123.0 | 29.630000 | 191.38 | 124.0 | 11.92 | 215.4 | 89.0 | 9.690 | 9.00 | 6.0 | 2.430000 | 4.0 |
| 3256 | 0.0 | 115.0 | 0.0 | 0.0 | 0.0 | 268.0 | 115.0 | 45.560000 | 153.60 | 106.0 | 13.06 | 232.3 | 65.0 | 10.450 | 17.00 | 1.0 | 4.590000 | 3.0 |
| 1165 | 0.0 | 50.0 | 1.0 | 1.0 | 26.0 | 307.1 | 94.0 | 52.210000 | 289.40 | 78.0 | 24.60 | 174.9 | 109.0 | 7.870 | 8.00 | 3.0 | 2.160000 | 0.0 |
| 2731 | 1.0 | 127.0 | 0.0 | 0.0 | 0.0 | 245.2 | 91.0 | 41.680000 | 217.20 | 92.0 | 18.46 | 243.1 | 128.0 | 10.940 | 13.90 | 6.0 | 3.750000 | 0.0 |
| 2745 | 1.0 | 61.0 | 0.0 | 1.0 | 40.0 | 105.0 | 78.0 | 17.850000 | 180.60 | 100.0 | 15.35 | 174.1 | 115.0 | 7.830 | 10.20 | 2.0 | 2.750000 | 2.0 |
| 3604 | 1.0 | 60.0 | 0.0 | 0.0 | 0.0 | 221.0 | 94.0 | 37.570000 | 308.20 | 72.0 | 26.20 | 233.2 | 94.0 | 10.490 | 10.10 | 5.0 | 2.730000 | 1.0 |
| 3070 | 1.0 | 154.0 | 0.0 | 0.0 | 0.0 | 154.5 | 122.0 | 26.270000 | 214.20 | 71.0 | 18.21 | 178.0 | 105.0 | 8.010 | 12.00 | 2.0 | 3.240000 | 3.0 |
| 437 | 1.0 | 100.0 | 0.0 | 0.0 | 0.0 | 278.0 | 76.0 | 47.260000 | 176.70 | 74.0 | 15.02 | 219.5 | 126.0 | 9.880 | 8.30 | 4.0 | 2.240000 | 0.0 |
| 1965 | 1.0 | 139.0 | 0.0 | 0.0 | 0.0 | 236.6 | 109.0 | 40.220000 | 169.90 | 107.0 | 14.44 | 212.3 | 118.0 | 9.550 | 11.10 | 2.0 | 3.000000 | 1.0 |
| 3873 | 1.0 | 63.0 | 0.0 | 0.0 | 0.0 | 162.5 | 86.0 | 27.630000 | 165.40 | 112.0 | 14.06 | 163.7 | 67.0 | 7.370 | 13.40 | 6.0 | 3.620000 | 4.0 |
| 2961 | 1.0 | 98.0 | 0.0 | 1.0 | 36.0 | 168.0 | 81.0 | 28.560000 | 163.20 | 125.0 | 13.87 | 172.7 | 120.0 | 7.770 | 8.00 | 2.0 | 2.160000 | 6.0 |
| 1551 | 1.0 | 225.0 | 0.0 | 0.0 | 0.0 | 165.4 | 106.0 | 28.120000 | 273.70 | 109.0 | 23.26 | 210.0 | 93.0 | 9.450 | 8.70 | 3.0 | 2.350000 | 0.0 |
| 2377 | 1.0 | 101.0 | 0.0 | 1.0 | 36.0 | 123.7 | 125.0 | 21.030000 | 172.60 | 106.0 | 14.67 | 280.5 | 127.0 | 12.620 | 8.80 | 4.0 | 2.380000 | 1.0 |
| 4153 | 1.0 | 106.0 | 1.0 | 0.0 | 0.0 | 196.7 | 100.0 | 33.440000 | 210.20 | 109.0 | 17.87 | 173.8 | 83.0 | 7.820 | 7.98 | 1.0 | 2.110000 | 4.0 |
| 3509 | 0.0 | 59.0 | 0.0 | 1.0 | 26.0 | 205.4 | 68.0 | 34.920000 | 115.00 | 115.0 | 9.78 | 214.7 | 130.0 | 9.660 | 9.40 | 2.0 | 2.540000 | 5.0 |
| 2820 | 0.0 | 74.0 | 0.0 | 1.0 | 27.0 | 154.1 | 122.0 | 26.200000 | 195.30 | 150.0 | 16.60 | 276.7 | 86.0 | 12.450 | 13.20 | 2.0 | 3.560000 | 4.0 |
| 1018 | 1.0 | 76.0 | 0.0 | 0.0 | 0.0 | 263.4 | 148.0 | 37.387971 | 230.30 | 69.0 | 19.58 | 170.6 | 101.0 | 7.680 | 11.40 | 5.0 | 3.080000 | 1.0 |
| 1904 | 1.0 | 90.8 | 0.0 | 1.0 | 33.0 | 167.8 | 91.0 | 28.530000 | 205.30 | 91.0 | 17.45 | 130.0 | 132.0 | 5.850 | 14.50 | 4.0 | 3.920000 | 4.0 |
| 1192 | 1.0 | 88.0 | 1.0 | 0.0 | 0.0 | 235.1 | 98.0 | 39.970000 | 251.80 | 79.0 | 21.40 | 285.9 | 76.0 | 12.870 | 7.20 | 2.0 | 1.940000 | 4.0 |
| 4980 | 1.0 | 73.0 | 0.0 | 0.0 | 0.0 | 177.2 | 86.6 | 30.120000 | 270.50 | 84.0 | 22.99 | 241.8 | 112.0 | 10.880 | 12.30 | 2.0 | 3.320000 | 3.0 |
| 1105 | 1.0 | 135.0 | 0.0 | 1.0 | 28.0 | 201.4 | 100.0 | 34.240000 | 246.50 | 117.0 | 20.95 | 154.8 | 131.0 | 6.970 | 12.90 | 4.0 | 3.480000 | 2.0 |
| 4421 | 1.0 | 67.0 | 0.0 | 0.0 | 0.0 | 177.4 | 102.0 | 30.160000 | 215.20 | 123.0 | 18.29 | 132.4 | 120.0 | 5.960 | 6.60 | 4.0 | 1.780000 | 4.0 |
| 69 | 1.0 | 150.0 | 0.0 | 0.0 | 0.0 | 178.9 | 101.0 | 30.410000 | 169.10 | 110.0 | 14.37 | 148.6 | 100.0 | 6.690 | 13.80 | 3.0 | 3.730000 | 4.0 |
| 2724 | 1.0 | 182.0 | 0.0 | 0.0 | 0.0 | 279.1 | 124.0 | 47.450000 | 180.50 | 108.0 | 15.34 | 217.5 | 104.0 | 9.790 | 9.50 | 11.0 | 2.570000 | 2.0 |
| 736 | 1.0 | 68.0 | 0.0 | 0.0 | 0.0 | 162.1 | 86.0 | 27.560000 | 155.00 | 86.0 | 13.18 | 189.7 | 87.0 | 8.540 | 11.00 | 9.0 | 2.970000 | 5.0 |
| 1240 | 1.0 | 108.0 | 0.0 | 1.0 | 34.0 | 162.1 | 83.0 | 27.560000 | 171.80 | 117.0 | 14.60 | 259.8 | 76.0 | 11.690 | 9.60 | 3.0 | 2.590000 | 4.0 |
| 3991 | 1.0 | 123.0 | 0.0 | 0.0 | 0.0 | 132.2 | 122.0 | 22.470000 | 169.90 | 101.0 | 14.44 | 150.1 | 123.0 | 6.750 | 12.90 | 11.0 | 3.480000 | 3.0 |
| 2785 | 1.0 | 38.0 | 0.0 | 0.0 | 0.0 | 175.7 | 109.0 | 29.870000 | 211.80 | 97.0 | 18.00 | 137.9 | 109.0 | 6.210 | 9.20 | 3.0 | 2.624881 | 5.0 |
| 2950 | 1.0 | 133.0 | 2.0 | 0.0 | 0.0 | 117.8 | 100.0 | 20.030000 | 199.20 | 105.0 | 16.93 | 244.1 | 119.0 | 10.980 | 11.80 | 4.0 | 3.190000 | 0.0 |
| 4137 | 1.0 | 76.0 | 0.0 | 0.0 | 0.0 | 222.0 | 99.0 | 37.740000 | 211.22 | 122.0 | 18.35 | 204.1 | 131.0 | 9.180 | 8.20 | 5.0 | 2.210000 | 3.0 |
| 2934 | 1.0 | 24.0 | 0.0 | 0.0 | 0.0 | 149.0 | 73.0 | 25.330000 | 131.00 | 81.0 | 11.14 | 238.6 | 69.0 | 10.606 | 8.60 | 3.0 | 2.320000 | 2.0 |
| 4406 | 1.0 | 116.0 | 1.0 | 0.0 | 0.0 | 232.2 | 98.0 | 39.470000 | 244.70 | 91.0 | 20.80 | 151.3 | 69.0 | 6.810 | 10.60 | 1.0 | 2.860000 | 7.0 |
| 1115 | 1.0 | 98.0 | 0.0 | 0.0 | 0.0 | 165.0 | 129.0 | 28.050000 | 202.60 | 113.0 | 17.22 | 172.3 | 94.0 | 7.750 | 12.50 | 7.0 | 3.380000 | 1.0 |
| 3369 | 1.0 | 73.0 | 0.0 | 0.0 | 0.0 | 137.6 | 113.0 | 23.390000 | 179.70 | 75.0 | 15.27 | 172.8 | 71.0 | 7.780 | 10.70 | 2.0 | 2.890000 | 1.0 |
| 1262 | 0.0 | 157.0 | 0.0 | 0.0 | 0.0 | 220.7 | 105.0 | 37.520000 | 119.30 | 127.0 | 10.14 | 165.1 | 113.0 | 7.430 | 11.50 | 7.0 | 3.110000 | 4.0 |
| 4624 | 1.0 | 178.0 | 0.0 | 0.0 | 0.0 | 225.4 | 106.0 | 38.320000 | 110.70 | 90.0 | 9.41 | 134.1 | 109.0 | 6.030 | 15.30 | 5.0 | 4.130000 | 2.0 |
| 1532 | 1.0 | 103.0 | 0.0 | 1.0 | 18.0 | 149.9 | 84.0 | 25.480000 | 170.90 | 84.0 | 14.53 | 171.5 | 112.0 | 7.720 | 11.50 | 7.0 | 3.110000 | 0.0 |
| 3147 | 0.0 | 62.0 | 0.0 | 0.0 | 0.0 | 245.3 | 91.0 | 41.700000 | 122.90 | 130.0 | 10.45 | 228.4 | 102.0 | 10.280 | 8.50 | 4.0 | 2.300000 | 4.0 |
| 2472 | 1.0 | 129.0 | 0.0 | 0.0 | 0.0 | 137.8 | 120.0 | 23.430000 | 225.80 | 110.0 | 19.19 | 145.2 | 95.0 | 6.530 | 10.20 | 6.0 | 2.750000 | 1.0 |
| 4494 | 0.0 | 137.0 | 0.0 | 0.0 | 0.0 | 248.3 | 63.0 | 42.210000 | 250.40 | 108.0 | 21.28 | 159.1 | 94.0 | 7.160 | 9.60 | 4.0 | 2.590000 | 1.0 |
| 2238 | 1.0 | 61.0 | 0.0 | 0.0 | 0.0 | 267.1 | 104.0 | 45.410000 | 180.40 | 131.0 | 15.33 | 230.6 | 106.0 | 10.380 | 12.10 | 4.0 | 4.670000 | 1.0 |
| 974 | 0.0 | 81.4 | 0.0 | 1.0 | 31.0 | 135.9 | 90.0 | 23.100000 | 271.00 | 84.0 | 23.04 | 179.1 | 89.0 | 8.060 | 9.50 | 7.0 | 2.570000 | 6.0 |
| 3601 | 1.0 | 166.0 | 0.0 | 0.0 | 0.0 | 156.8 | 100.0 | 26.660000 | 171.50 | 93.0 | 14.58 | 149.1 | 97.0 | 6.710 | 11.20 | 10.0 | 3.020000 | 1.0 |
| 91 | 1.0 | 155.0 | 0.0 | 0.0 | 0.0 | 203.4 | 100.0 | 34.580000 | 190.90 | 104.0 | 16.23 | 196.0 | 119.0 | 8.820 | 8.90 | 4.0 | 2.400000 | 0.0 |
| 2535 | 1.0 | 125.0 | 0.0 | 0.0 | 0.0 | 113.0 | 108.0 | 19.210000 | 169.20 | 107.0 | 14.38 | 156.6 | 61.0 | 7.050 | 9.20 | 5.0 | 2.480000 | 2.0 |
| 1516 | 1.0 | 68.0 | 0.0 | 0.0 | 0.0 | 213.9 | 112.0 | 36.360000 | 260.50 | 100.0 | 22.14 | 233.8 | 97.0 | 10.520 | 8.40 | 3.0 | 2.270000 | 1.0 |
miss_features.describe()
| churn | accountlength | internationalplan | voicemailplan | numbervmailmessages | totaldayminutes | totaldaycalls | totaldaycharge | totaleveminutes | totalevecalls | totalevecharge | totalnightminutes | totalnightcalls | totalnightcharge | totalintlminutes | totalintlcalls | totalintlcharge | numbercustomerservicecalls | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 48.000000 | 48.000000 | 48.000000 | 48.000000 | 48.000000 | 48.000000 | 48.000000 | 48.000000 | 48.000000 | 48.000000 | 48.000000 | 48.00000 | 48.000000 | 48.000000 | 48.00000 | 48.000000 | 48.000000 | 48.000000 |
| mean | 0.812500 | 104.316667 | 0.125000 | 0.250000 | 7.604167 | 191.066667 | 101.116667 | 32.328083 | 195.739583 | 101.104167 | 16.555417 | 193.81875 | 100.562500 | 8.718875 | 10.53500 | 4.270833 | 2.875935 | 2.562500 |
| std | 0.394443 | 41.501394 | 0.392754 | 0.437595 | 13.614353 | 49.643278 | 17.621739 | 8.275923 | 49.482537 | 18.585329 | 4.263230 | 41.56264 | 20.267263 | 1.867009 | 2.34517 | 2.473172 | 0.683270 | 1.855484 |
| min | 0.000000 | 24.000000 | 0.000000 | 0.000000 | 0.000000 | 105.000000 | 63.000000 | 17.850000 | 75.300000 | 69.000000 | 6.400000 | 130.00000 | 61.000000 | 5.850000 | 5.80000 | 1.000000 | 1.570000 | 0.000000 |
| 25% | 1.000000 | 71.750000 | 0.000000 | 0.000000 | 0.000000 | 154.400000 | 89.750000 | 26.252500 | 169.725000 | 85.500000 | 14.377500 | 162.55000 | 86.750000 | 7.317500 | 8.77500 | 2.000000 | 2.372500 | 1.000000 |
| 50% | 1.000000 | 101.000000 | 0.000000 | 0.000000 | 0.000000 | 177.300000 | 100.000000 | 30.140000 | 191.140000 | 104.500000 | 15.790000 | 180.15000 | 101.500000 | 8.105000 | 10.20000 | 4.000000 | 2.750000 | 2.500000 |
| 75% | 1.000000 | 130.000000 | 0.000000 | 0.250000 | 4.500000 | 227.100000 | 112.250000 | 37.885000 | 226.925000 | 112.250000 | 19.287500 | 228.95000 | 118.250000 | 10.305000 | 12.02500 | 5.250000 | 3.260000 | 4.000000 |
| max | 1.000000 | 225.000000 | 2.000000 | 1.000000 | 40.000000 | 307.100000 | 148.000000 | 52.210000 | 308.200000 | 150.000000 | 26.200000 | 285.90000 | 132.000000 | 12.870000 | 17.00000 | 11.000000 | 4.670000 | 7.000000 |
variables = ['churn', 'internationalplan', 'voicemailplan']
fig, ax = plt.subplots(1, 3, figsize=(15, 5))
titles = ['Churn', 'International Plan', 'Voicemail Plan']
for i, variable in enumerate(variables):
miss_features[variable].value_counts().plot(kind='bar', ax=ax[i])
ax[i].set_title(titles[i])
plt.tight_layout()
fig, ax = plt.subplots(3, 5, figsize=(15, 10))
variables = ['accountlength', 'numbervmailmessages', 'totaldayminutes', 'totaldaycalls', 'totaldaycharge',
'totaleveminutes', 'totalevecalls', 'totalevecharge', 'totalnightminutes', 'totalnightcalls',
'totalnightcharge', 'totalintlminutes', 'totalintlcalls', 'totalintlcharge', 'numbercustomerservicecalls']
ax = ax.flatten()
for i, variable in enumerate(variables):
miss_features[variable].plot.density(ax=ax[i], legend=True)
ax[i].set_title(variable)
plt.tight_layout()
plt.show()
"From the vairables disturbution and discrption the main difference to the original dataset is on the variable "Churn" where, unlike the original disturbution, we got more cases in the class churn = Yes. This could happen for two reasons: the customers who are going to cease their subscription are indeed more difficult to predict because of natural reasons (more impulsive etc.) or becuase is the class we got less data so it´s harder for the models to extract knlowlegde and generelize roboust predictions from it. "
Final Recomendations¶
The best prediction model, when trained either with a balanced or unbalanced dataset, was the Gradient Boosting Tree Ensemble w With a high accuracy score of 95%, and a F1 score of 0,82 has the best overall performance and the best at predicting churn
Overall, the Machine Learning criteria was successfully achieved, since several of the created models have a high accuracy rate in predicting customer churn, with high precision and recall. The business success criteria can only be evaluated by using the model to implement changes to reduce costs by increasing customer retention.